MiniCPM-o 4.5 is a 9‑billion‑parameter omnimodal model that can handle text, images, audio, and video, including real-time voice conversations and camera input. Users can talk to it by voice, stream their webcam, or upload images and clips while the model responds with synthesized speech.
In demos, MiniCPM-o narrates live scenes from a video feed, reads supermarket price labels aloud, and explains what someone is sketching on a whiteboard in real time. It can also perform high-quality OCR on images and handwritten notes, understanding crossed-out words and choosing the corrected text, and it answers visual reasoning puzzles reliably.
A standout feature is voice transformation: the model can reply in the style of public figures such as “Elon Musk” or fictional voices like “Samantha” from Her, showcasing flexible TTS capabilities. On benchmarks, MiniCPM-o outperforms other omni models across visual understanding, live streaming, and speech conversation, despite its relatively small size.
The 9B checkpoint is about 23.4GB, making it feasible to run on a single RTX 4090, with future quantizations likely targeting lower-VRAM GPUs. Setup instructions are already available on its project page.
Comments
No comments yet. Be the first to share your thoughts!