MiniCPM-o 4.5 packs live voice, vision, and video into a 9B open model

Duane Villanueva • Mar 9, 2026 • 1 min read

MiniCPM-o 4.5 is a 9‑billion‑parameter omnimodal model that can handle text, images, audio, and video, including real-time voice conversations and camera input. Users can talk to it by voice, stream their webcam, or upload images and clips while the model responds with synthesized speech.

In demos, MiniCPM-o narrates live scenes from a video feed, reads supermarket price labels aloud, and explains what someone is sketching on a whiteboard in real time. It can also perform high-quality OCR on images and handwritten notes, understanding crossed-out words and choosing the corrected text, and it answers visual reasoning puzzles reliably.

A standout feature is voice transformation: the model can reply in the style of public figures such as “Elon Musk” or fictional voices like “Samantha” from Her, showcasing flexible TTS capabilities. On benchmarks, MiniCPM-o outperforms other omni models across visual understanding, live streaming, and speech conversation, despite its relatively small size.

The 9B checkpoint is about 23.4GB, making it feasible to run on a single RTX 4090, with future quantizations likely targeting lower-VRAM GPUs. Setup instructions are already available on its project page.

Duane Villanueva

Communication graduate, closet cynic, and kid at heart. Duane is a rare person to find, quite literally. He often takes to himself but has proven his mettle in tech media with his quick wits. Well, the portfolio of scriptwriting, web content, and public relations help too, we suppose. As a homebody, he often spends his time on the streaming platform Twitch or ‘farming’ gaming clips with friends. He is also an avid fan of round glasses and anything relative to blueberries.

194 posts

MiniCPM-o 4.5 packs live voice, vision, and video into a 9B open model

Comments

Cancel reply