
Multimodal isn’t “a bigger prompt”. It’s a perception + reasoning + UX system with new contracts, new failure modes, and new latency/cost constraints. This month is about designing it so it behaves predictably.
Axel Domingues
In 2023, the hard lesson was: LLMs are probabilistic components.
In early 2024, the next lesson lands:
LLMs are an I/O runtime.
Long context changed how we do retrieval and state.
Multimodality changes something more fundamental:
it changes the interaction surface of the product.
A text-only assistant lives in the world of prompts, tokens, and documents.
A multimodal assistant lives in the world of:
That’s not “prompt engineering”.
That’s systems design.
text + vision + audio (input and output), with the option to route between multiple models and tools.
The goal this month
Turn “it can see and hear” into a designed user experience you can operate.
The mindset shift
Multimodal isn’t a model feature.
It’s a pipeline + UI contract + failure budget.
What changes
You now ingest untrusted media and must turn it into usable, auditable state.
The takeaway
Separate perception from reasoning, and design graceful degradation for every step.
A text chat is roughly:
user text → model → response
A multimodal product is closer to:
capture → normalize → perceive → reason → confirm → act → explain
Where “perceive” might be:
And “act” might be:
The core design implication is this:
you need to own the intermediate representations, not just the final model output.
If you don’t, you can’t debug. And you can’t build stable UX.
This is the backbone I keep coming back to:
You can implement this in one service or ten — the point is the separation of concerns.
Design each layer accordingly.Perception is lossy. Reasoning is probabilistic. Tools are dangerous.
Treat modalities like sensors on a robot:
Sensors produce signals, not truth.
So your product needs to support:
This is where classic UX design meets reliability engineering.
A multimodal system often has too much raw material:
If you show everything, the UI becomes a log file.
Instead, design the response as layers:
Default view
Answer + next action, in human language.
Expandable detail
“Show evidence” reveals transcript spans, OCR snippets, timestamps, and sources.
This is the same discipline as RAG citations, but applied to media-derived signals.
Users don’t know what the model needs.
They know what they can provide:
So the assistant should ask for capture improvements, not vague clarification.
Examples:
This isn’t politeness — it’s accuracy engineering.
If you don’t detect low-quality inputs early, you waste tokens and return confident nonsense.
In January I argued that prompting is not programming — you need contracts.
Multimodal doubles down on this: you need contracts for the inputs too.
Here are contracts that work in real systems.
They store signals, plus enough metadata to audit and re-run perception later.
When teams skip this separation, they usually do one of two things:
Both approaches work in demos.
Both approaches fall apart in production.
The better pattern:
That’s how you get:
Multimodality is a token multiplier.
So budgets can’t be a backend-only concern. They must be a product decision:
What does the user get when budgets are tight?
This is where staged responses shine.
If you only ship “correct” without “fast”, the UX feels broken.
If you only ship “fast” without “correct”, you ship lies faster.
In text systems, we talk about hallucinations.
In multimodal, you also have:
A useful failure budget isn’t a single number.
It’s per stage.
Perception budget
How often can OCR/ASR be wrong before the feature becomes unusable?
Reasoning budget
How often can summaries be misleading before trust collapses?
Action budget
How often can side effects be wrong?
(Usually: almost never.)
UX budget
How often can the user be confused about “why” before they stop using it?
Your mitigation strategy should match the budget:
Text systems already have prompt injection.
Multimodal adds:
So the safe stance is:
treat all perceived content as untrusted input until verified.
“Looks like the user asked” is not authorization.
Below is a flow that consistently produces good multimodal UX without drowning in complexity.
Symptom: the assistant confidently references UI labels that were not relevant.
Fix: treat OCR as raw signal and force reasoning to cite the exact span/crop used. If it can’t point to evidence, it shouldn’t claim it.
Symptom: a negation gets lost (“don’t” becomes “do”), names are swapped, numbers are wrong.
Fix: use timestamped transcripts, and for high-stakes fields (amounts, dates), require explicit confirmation and show the exact quoted segment.
Symptom: it describes what it sees, but doesn’t solve the problem.
Fix: ask one clarifying question early: “What do you want me to do with this?” Then proceed with staged responses.
Symptom: user uploads media and waits in silence.
Fix: stream progress (“processing audio… extracting text…”) and provide partial output quickly (summary first, details later).
Multimodal systems generate expensive, failure-prone pipelines.
If you don’t log the right things, you will debug by vibes.
What matters:
How often did the user correct the system’s perceived state?
The experience you want is not “wow it described my image”.
It’s:
That requires designing multimodality as a product subsystem — not a model checkbox.
If you can afford it, separate them.
A dedicated perception stage gives you:
End-to-end multimodal models are great for rapid iteration and simpler stacks, but they make debugging and provenance harder.
Assume perceived content is untrusted input.
Only execute side effects after:
If the action changes money, identity, or access: add a second gate.
Don’t paste raw OCR/transcripts by default.
Store them as artifacts, then summarize into a compact working state:
Only pull full raw text when the task truly needs it.
April made the point that model selection becomes architecture.
May made the case that open weights are production engineering, not ideology.
June added a new reality:
Multimodal systems generate lots of signals — but signals aren’t truth.
Next month we build the discipline that turns signals into reliable answers:
RAG you can evaluate — retrieval pipelines, reranking, citations, and truth boundaries.
Because once you can see and hear…
the next hard problem is proving you can be right.
RAG You Can Evaluate: retrieval pipelines, reranking, citations, and truth boundaries
RAG isn’t “add a vector DB and hope.” It’s a search-and-reasoning subsystem with contracts, metrics, and failure budgets — and you can only operate what you can evaluate.
Open Weights in Production: evaluation, licensing, and guardrails
Open weights shift your risk from vendor to you. This month is the playbook: evaluate like a product, treat licensing as architecture, and ship with guardrails that survive real users.