There's a comfortable story about where AI is going: the models get bigger, the data centres get bigger, and one day a single colossal "brain" does everything. It's a tidy story. It's also probably wrong — or at least, only half the picture. The more interesting future isn't one enormous mind. It's many small minds, each good at one thing, wired together so the whole is far more than the sum of its parts.
If that sounds speculative, it shouldn't. You're running an existence proof of it right now, between your ears. And the first credible engineering versions are already shipping. This article connects those two things: how your nervous system is a multi-model system, why software is converging on the same shape, and what tools like Claude Code — which run small agents locally while leaning on a big model in the cloud — tell us about the direction of travel. I'll define the jargon as we go.
From "one big brain" to "a society of minds"
The instinct to scale a single monolith is understandable — it has worked astonishingly well for large language models. But intelligence in nature is conspicuously not one giant neuron. Your brain has on the order of 86 billion neurons, organised into dozens of specialised regions that talk to each other. No single region "is" you; the experience of being you emerges from their collaboration.
The computer scientist Marvin Minsky made this his life's argument. In The Society of Mind (1986) he proposed that the mind is built from many small processes — "agents" — each individually dumb, that produce intelligence only through their interaction. Decades later, that framing is turning out to be a remarkably good blueprint for building software, not just for describing biology.
The unit of intelligence may not be the model. It may be the system of models — and the wiring, routing and feedback that connect them.
Your nervous system is already a multi-model system
Before we talk about software, look at the reference design evolution spent a few hundred million years debugging. The human nervous system is not a single processor. It's a distributed, layered, specialised network with several properties that any AI architect would recognise — and envy.
1. It splits "reflex" from "deliberation" by latency
When you touch something hot, your hand pulls back before you consciously feel the pain. That's a reflex arc: the signal goes from sensor to spinal cord and straight back to the muscle, skipping the brain entirely. It's fast because it's local and simple. Meanwhile the slower signal reaches your brain, which does the expensive work of interpreting "that was the kettle, move it, check for a burn."
This is exactly the tiered-latency design modern systems want: handle the urgent, narrow, common case instantly and locally; escalate the rare, hard, ambiguous case to something slower and more powerful. The body doesn't send every stimulus to the cortex, and a good AI system shouldn't send every request to a giant model.
2. It has a fast, cheap "is this dangerous?" model running in parallel
The neuroscientist Joseph LeDoux described two routes for fear. A "low road" runs from the thalamus straight to the amygdala — quick and dirty, enough to flinch at a stick that might be a snake. A "high road" goes thalamus → visual cortex → amygdala — slower, but able to confirm "it's just a stick." A cheap, fast model triggers a cautious default; a slower, accurate model corrects it. The cost of a false alarm (flinching at a stick) is tiny; the cost of a miss (ignoring a snake) is fatal — so the architecture is deliberately biased toward fast-and-cautious, then refined.
If you've read our walkthrough of how CricCuts works, this will feel familiar: a cheap detector casts a wide net and a more careful stage refines it. Biasing toward "catch everything, then filter" is not a hack — it's the same trade-off your amygdala makes.
3. It routes, it specialises, and it has a co-processor
The thalamus acts like a router, relaying incoming sensory streams to the right cortical region. The cortex is full of specialists: visual cortex for sight, auditory cortex for sound, and so on — regions that are, in effect, purpose-built models. The cerebellum is a dedicated co-processor for timing and smooth movement; the basal ganglia gate which action actually fires; the prefrontal cortex plans and supervises; the hippocampus writes new memories. None of these is general-purpose. Intelligence is what happens when they coordinate.
4. It degrades gracefully
Damage one region and you often lose one capability, not the whole person. A distributed system of specialists is inherently more robust than a monolith where one failure takes everything down. That fault-tolerance is a feature of the architecture, not an accident.
The same shape is appearing in AI
Now the engineering. Three trends, each independently pushing away from "one model does everything" toward "many models cooperate." They map onto the brain more closely than their inventors usually admit.
Mixture-of-Experts: specialists inside one model
Many of today's largest models are quietly Mixture-of-Experts (MoE). Instead of one dense network where every parameter fires for every input, an MoE model contains many "expert" sub-networks plus a small router (also called a gating network) that, for each input, activates only the few experts best suited to it.
The parallel to the brain is almost on-the-nose: specialised circuits, a routing mechanism, and sparse activation — only the relevant parts light up. Your visual cortex doesn't fire when you're listening to music.
Agent orchestration: specialists as separate programs
MoE puts the specialists inside one model. The other approach puts them outside, as separate programs that a coordinator directs. This is the world of agents.
An orchestrator decomposes a big problem ("refactor this module and add tests"), hands pieces to sub-agents or specialised tools, and integrates what comes back. That is precisely the role of the prefrontal cortex marshalling the rest of the brain — and precisely the role of a conductor in front of an orchestra, where no single player carries the symphony.
Neuro-symbolic systems: neural nets plus old-fashioned logic
A third strand mixes kinds of models, not just instances. A neuro-symbolic system pairs neural networks (great at perception, fuzzy pattern-matching) with classical, rule-based logic (great at being exact, fast and explainable). You use a neural model only where its judgement is irreplaceable, and let transparent logic do the rest. The brain does something similar: fast intuitive pattern recognition feeding slower, structured reasoning.
The present already shows it: Claude Code as a worked example
This isn't only a forecast. Look at how an agentic coding tool like Claude Code actually runs, and you'll see a small society of cooperating parts, deliberately split between your machine and the cloud.
- A big model in the cloud does the heavy thinking. The expensive, general reasoning — planning, writing code, understanding your request — runs on powerful servers. This is the "deliberation" tier: slow(er), costly, but very capable. Call it the cortex.
- Agents and tools execute locally, on your device. Reading your files, running your tests, editing code, searching your repository — these happen on your machine, where the data already lives and the action needs to take effect. That's fast, private, and grounded in your actual context. Call it the peripheral nervous system: the senses and the hands.
- It spawns sub-agents for sub-tasks. A lead agent can delegate a focused job — "explore this part of the codebase," "review this change" — to a separate sub-agent with its own context, then fold the result back in. Specialists, summoned on demand, exactly like recruiting a brain region for a task.
- It keeps and recalls memory. Durable notes persist across sessions and are pulled back in when relevant — a hippocampus for the workflow.
- It reaches out to servers and the web when needed. Fetching documentation, querying an API, running a cloud task — escalating beyond what's local when the local context isn't enough.
Notice this is the complement to pure on-device AI, not a contradiction of it. Edge AI argues for doing work locally; this argues for dividing work across local and remote by what each does best. The brain does both at once: reflexes at the edge, deliberation at the centre, constant traffic between them.
Brain ↔ distributed AI, side by side
| In the nervous system | In a multi-model AI system | What it buys you |
|---|---|---|
| Reflex arc (spinal cord) | Small on-device model / fast local check | Instant response, no round-trip |
| Amygdala "low road" | Cheap detector biased to catch everything | Never miss the rare, costly event |
| Cortex deliberation | Large model in the cloud | Deep, general reasoning when it matters |
| Thalamus (relay) | Router / gating network / dispatcher | Send each input to the right specialist |
| Specialised cortical regions | Experts (MoE) or specialist models/agents | Better & cheaper than one generalist |
| Prefrontal cortex | Orchestrator / lead agent | Plan, delegate, integrate results |
| Cerebellum (co-processor) | Dedicated tool or accelerator | Offload a narrow job efficiently |
| Hippocampus | Memory store / retrieval | Carry context across time |
| Graceful degradation | Fallbacks & redundancy | One part fails, the system survives |
Why a combination beats a monolith
🗿 One giant model for everything
- Pays its full cost on every request, trivial or hard
- One failure mode can affect everything
- Hard to update one skill without touching all
- Sensitive data must go wherever the model lives
- Opaque — difficult to see why it decided
🤝 Many small models cooperating
- Right-sized — cheap path for easy cases, escalate only when needed
- Robust — a failed specialist degrades gracefully, with fallbacks
- Composable — swap or upgrade one model without retraining the rest
- Private — keep sensitive steps on-device, send only what's necessary
- Explainable — you can inspect each part's contribution
These advantages compound. A society of models can put the privacy-sensitive, latency-critical work at the edge and reserve the cloud for genuinely hard reasoning — getting the privacy and speed of local with the power of remote, and paying the big cost only when the problem deserves it.
The honest part: cooperation is hard
None of this is free, and it would be dishonest to pretend otherwise. Coordinating many models introduces problems a monolith doesn't have:
- Orchestration overhead. Splitting and recombining work has its own cost; for easy tasks, one model is simply simpler.
- Communication. The parts need a shared "language" — protocols, schemas, well-defined interfaces — or they misunderstand each other.
- Error propagation. A confident-but-wrong specialist can mislead the orchestrator. Systems need verification and skepticism, much as the cortex overrides a jumpy amygdala.
- Control and coherence. Who arbitrates when specialists disagree? Neuroscience even has a name for the deep version of this — the binding problem: how does distributed processing produce one unified experience? Multi-agent systems face their own milder version of it every day.
The reason to be optimistic is that biology solved versions of all of these, which tells us they're solvable — and that the answers tend to look like routing, hierarchy, feedback loops and graceful fallback, exactly the tools software is now reaching for.
CricCuts: a tiny society of models you can hold
You don't need a data centre to see the pattern. CricCuts — a free, on-device cricket video editor — is a small working instance of it. It doesn't throw one giant model at "find the highlights." It assembles a little team: fast classical signal processing to spot candidate moments, a tiny neural voice-activity model to tell a shouted "shot!" from a bat-crack, an optional pose model and a speech recogniser for when you want more, and a coordinator that weighs their evidence — with you closing the loop by confirming a few clips. Cheap specialists, summoned only when they earn their place, cooperating to do one job well, entirely on your phone.
It's the same philosophy as the brain and the same philosophy as a well-built agent system, just at pocket scale: prefer the smallest capable part, keep each decision explainable, escalate only when needed, and let the whole be smarter than any piece. If you'd like the full walkthrough, our interactive course on how a phone watches cricket and cuts the highlights itself takes you through every specialist in the team.
Where this is heading
Picture your personal AI not as a single app you talk to, but as a distributed nervous system spread across your devices and the cloud: instant reflexes on your phone and watch, specialists for vision, speech and health, memory that persists, and a coordinator that pulls in a powerful remote model only for the genuinely hard problems — keeping the private, urgent, everyday work local. Less "one oracle in a warehouse," more "a calm, well-organised team that mostly works quietly in the background, on your side and on your hardware."
The future of AI probably isn't a bigger brain. It's a better-organised one — many small models, each excellent at its job, wired together with the same elegant division of labour that's been running in your head all along.
Glossary
- Model
- A program trained on data to turn inputs into outputs (e.g. audio in, "is this speech?" out).
- Training vs inference
- Training is the expensive one-time process of teaching a model from data; inference is running the finished model to get an answer — the part that happens every time you use it.
- Latency
- How long you wait for a response. Local work has low latency; a round-trip to a server adds more.
- Edge / on-device
- Running a model on the user's own hardware (phone, watch, camera) rather than on a remote server.
- Mixture-of-Experts (MoE)
- One model built from many specialist sub-networks plus a router that activates only a few per input — big-model knowledge at small-model running cost.
- Router / gating network
- The component that decides which expert(s) or specialist should handle a given input — the system's "thalamus."
- Agent
- A model given a goal and the ability to take actions (run tools, edit files, search) in a loop until the goal is met.
- Orchestration
- A lead agent planning a task, delegating sub-tasks to other agents/tools, and combining the results.
- Neuro-symbolic
- Combining neural networks (perception, fuzzy matching) with classical rule-based logic (exact, fast, explainable).
- Reflex arc
- A fast, hard-wired sensor→spinal-cord→muscle loop that acts without involving the brain — biology's "tiny model at the edge."
- Graceful degradation
- When one part fails, the overall system loses a little capability instead of collapsing entirely.
- Binding problem
- The neuroscience puzzle of how many distributed brain processes combine into a single, unified experience — the hardest form of "how do the parts cooperate?"
See a small society of models in action
CricCuts puts a cooperating team of small, on-device models in your hand — free, private and offline. It's edge AI and multi-model design you can actually hold.
Get the app → How it worksMore on the CricCuts blog — start with why the future of AI is small models on the edge.