You shipped an AI feature, users loved it, and then the API bill showed up. Every new user makes it bigger.
Or a serious buyer went quiet the moment you could not promise their data would stay off a third-party server.
Google just released something that speaks to both of those problems. Gemma 4 12B is an open multimodal model small enough to run on a laptop with 16GB of memory. It handles text, images, and audio, and lands close to the performance of Google's much larger 26B model.
It ships under an Apache 2.0 license, so you can use it, modify it, and deploy it commercially.
For a founder deciding how to build, that last line is the one that matters most.
The cost question just moved
Every call to a cloud model costs money per token, and that cost grows with every user you add.
A model running on your own hardware is a fixed cost. You pay for the machine once, and inference after that is effectively free.
For an early product burning runway on API bills, that difference can decide whether a feature is worth shipping at all. Gemma 4 12B makes the local option viable for real workloads, not just demos.
What encoder-free buys you
Most multimodal systems bolt a separate vision encoder and a separate audio encoder onto the language model, which adds latency and memory every time you process a screenshot or a voice note.
Gemma 4 12B feeds images and audio straight into the model backbone using a lightweight 35-million-parameter embedder and direct projection for raw audio.
The result is one model that can read a screenshot, hear a voice note, and answer a question, all on a normal machine.
For your product, that means fewer moving parts and faster responses for your users.
The privacy angle founders underrate
When the model runs on the device, customer data never leaves it. No prompt, no document, no recording is sent to a third-party server.
If you sell into healthcare, finance, or legal, this turns a compliance headache into a selling point.
You can tell a buyer their data stays on their own machine and mean it, which is often easier than negotiating data processing agreements with a cloud vendor.
Where the line sits
Gemma 4 12B will not match a frontier cloud model on your hardest reasoning tasks, and pretending otherwise will burn you in production.
The real skill is mapping which parts of your product can run on-device and which still belong in the cloud.
A transcription step or an image-tagging feature can run local and save you a fortune. A complex multi-step agent might still need the cloud.
Most products end up with a mix, and getting that split right is what protects both your margins and your output quality.
Figuring out your split
Most founders we work with do not need to pick one path. They need a clear map of which features run on-device, which stay in the cloud, and what each choice does to their costs and their compliance story.
That is the conversation we have on a call. We look at your product, your users, and your data, and we work out where each piece of AI should run.
If you are weighing a local model like Gemma 4 12B, book a free call and we will sketch that map with you, with a straight answer on whether it moves the needle for your situation. No obligation.
