Why the Future of AI Might Be Many Small Models Working Together

There's a comfortable story about where AI is going: the models get bigger, the data centres get bigger, and one day a single colossal "brain" does everything. It's a tidy story. It's also probably wrong — or at least, only half the picture. The more interesting future isn't one enormous mind. It's many small minds, each good at one thing, wired together so the whole is far more than the sum of its parts.

If that sounds speculative, it shouldn't. You're running an existence proof of it right now, between your ears. And the first credible engineering versions are already shipping. This article connects those two things: how your nervous system is a multi-model system, why software is converging on the same shape, and what tools like Claude Code — which run small agents locally while leaning on a big model in the cloud — tell us about the direction of travel. I'll define the jargon as we go.

First, three words we'll use constantly A model is a program that's been trained on data to map inputs to outputs (sound → "is this speech?"). Inference is the act of running a trained model to get an answer (cheap and fast, versus the expensive one-time training). Latency is simply how long you wait for that answer — often the difference between a tool that feels alive and one that feels broken.

From "one big brain" to "a society of minds"

The instinct to scale a single monolith is understandable — it has worked astonishingly well for large language models. But intelligence in nature is conspicuously not one giant neuron. Your brain has on the order of 86 billion neurons, organised into dozens of specialised regions that talk to each other. No single region "is" you; the experience of being you emerges from their collaboration.

The computer scientist Marvin Minsky made this his life's argument. In The Society of Mind (1986) he proposed that the mind is built from many small processes — "agents" — each individually dumb, that produce intelligence only through their interaction. Decades later, that framing is turning out to be a remarkably good blueprint for building software, not just for describing biology.

The unit of intelligence may not be the model. It may be the system of models — and the wiring, routing and feedback that connect them.

Your nervous system is already a multi-model system

Before we talk about software, look at the reference design evolution spent a few hundred million years debugging. The human nervous system is not a single processor. It's a distributed, layered, specialised network with several properties that any AI architect would recognise — and envy.

1. It splits "reflex" from "deliberation" by latency

When you touch something hot, your hand pulls back before you consciously feel the pain. That's a reflex arc: the signal goes from sensor to spinal cord and straight back to the muscle, skipping the brain entirely. It's fast because it's local and simple. Meanwhile the slower signal reaches your brain, which does the expensive work of interpreting "that was the kettle, move it, check for a burn."

Reflex arc A short, hard-wired loop — sensory neuron → a relay neuron in the spinal cord → motor neuron — that produces a protective movement without waiting for the brain. Nature's version of running a tiny model right at the sensor because a round-trip to headquarters would be too slow.

This is exactly the tiered-latency design modern systems want: handle the urgent, narrow, common case instantly and locally; escalate the rare, hard, ambiguous case to something slower and more powerful. The body doesn't send every stimulus to the cortex, and a good AI system shouldn't send every request to a giant model.

2. It has a fast, cheap "is this dangerous?" model running in parallel

The neuroscientist Joseph LeDoux described two routes for fear. A "low road" runs from the thalamus straight to the amygdala — quick and dirty, enough to flinch at a stick that might be a snake. A "high road" goes thalamus → visual cortex → amygdala — slower, but able to confirm "it's just a stick." A cheap, fast model triggers a cautious default; a slower, accurate model corrects it. The cost of a false alarm (flinching at a stick) is tiny; the cost of a miss (ignoring a snake) is fatal — so the architecture is deliberately biased toward fast-and-cautious, then refined.

If you've read our walkthrough of how CricCuts works, this will feel familiar: a cheap detector casts a wide net and a more careful stage refines it. Biasing toward "catch everything, then filter" is not a hack — it's the same trade-off your amygdala makes.

3. It routes, it specialises, and it has a co-processor

The thalamus acts like a router, relaying incoming sensory streams to the right cortical region. The cortex is full of specialists: visual cortex for sight, auditory cortex for sound, and so on — regions that are, in effect, purpose-built models. The cerebellum is a dedicated co-processor for timing and smooth movement; the basal ganglia gate which action actually fires; the prefrontal cortex plans and supervises; the hippocampus writes new memories. None of these is general-purpose. Intelligence is what happens when they coordinate.

A quick neuro-glossary Thalamus — the central relay/router for sensory signals. Amygdala — fast threat detection. Cerebellum — timing and motor coordination (a specialised co-processor). Basal ganglia — action selection ("which move actually fires?"). Prefrontal cortex — planning and executive control (the orchestrator). Hippocampus — forming long-term memories. Myelin — fatty insulation on nerve fibres that speeds signals up; nature optimising latency.

4. It degrades gracefully

Damage one region and you often lose one capability, not the whole person. A distributed system of specialists is inherently more robust than a monolith where one failure takes everything down. That fault-tolerance is a feature of the architecture, not an accident.

The same shape is appearing in AI

Now the engineering. Three trends, each independently pushing away from "one model does everything" toward "many models cooperate." They map onto the brain more closely than their inventors usually admit.

Mixture-of-Experts: specialists inside one model

Many of today's largest models are quietly Mixture-of-Experts (MoE). Instead of one dense network where every parameter fires for every input, an MoE model contains many "expert" sub-networks plus a small router (also called a gating network) that, for each input, activates only the few experts best suited to it.

Mixture-of-Experts (MoE) A model made of many specialist sub-networks and a router that picks a handful to run per input. You get the knowledge of a huge model but the running cost of a small one, because most of it stays switched off at any moment. The router is doing a job your thalamus would recognise: send this input to the right specialist.

The parallel to the brain is almost on-the-nose: specialised circuits, a routing mechanism, and sparse activation — only the relevant parts light up. Your visual cortex doesn't fire when you're listening to music.

Agent orchestration: specialists as separate programs

MoE puts the specialists inside one model. The other approach puts them outside, as separate programs that a coordinator directs. This is the world of agents.

Agent & orchestration An agent is a model given a goal plus the ability to take actions — call a tool, run a command, search the web — in a loop until the goal is met. Orchestration is one "lead" agent planning a task, delegating sub-tasks to other agents or tools, and combining their results. Think project manager and team, not lone genius.

An orchestrator decomposes a big problem ("refactor this module and add tests"), hands pieces to sub-agents or specialised tools, and integrates what comes back. That is precisely the role of the prefrontal cortex marshalling the rest of the brain — and precisely the role of a conductor in front of an orchestra, where no single player carries the symphony.

Neuro-symbolic systems: neural nets plus old-fashioned logic

A third strand mixes kinds of models, not just instances. A neuro-symbolic system pairs neural networks (great at perception, fuzzy pattern-matching) with classical, rule-based logic (great at being exact, fast and explainable). You use a neural model only where its judgement is irreplaceable, and let transparent logic do the rest. The brain does something similar: fast intuitive pattern recognition feeding slower, structured reasoning.

The present already shows it: Claude Code as a worked example

This isn't only a forecast. Look at how an agentic coding tool like Claude Code actually runs, and you'll see a small society of cooperating parts, deliberately split between your machine and the cloud.

A big model in the cloud does the heavy thinking. The expensive, general reasoning — planning, writing code, understanding your request — runs on powerful servers. This is the "deliberation" tier: slow(er), costly, but very capable. Call it the cortex.
Agents and tools execute locally, on your device. Reading your files, running your tests, editing code, searching your repository — these happen on your machine, where the data already lives and the action needs to take effect. That's fast, private, and grounded in your actual context. Call it the peripheral nervous system: the senses and the hands.
It spawns sub-agents for sub-tasks. A lead agent can delegate a focused job — "explore this part of the codebase," "review this change" — to a separate sub-agent with its own context, then fold the result back in. Specialists, summoned on demand, exactly like recruiting a brain region for a task.
It keeps and recalls memory. Durable notes persist across sessions and are pulled back in when relevant — a hippocampus for the workflow.
It reaches out to servers and the web when needed. Fetching documentation, querying an API, running a cloud task — escalating beyond what's local when the local context isn't enough.

🧠

The pattern, stated plainly: cognition where it's most capable (a large model in the cloud), action and context where they're most grounded (small agents and tools on your device), specialists summoned on demand, memory persisted, and escalation to bigger resources only when the cheap, local path isn't enough. That is a nervous system's division of labour, rebuilt in software.

Notice this is the complement to pure on-device AI, not a contradiction of it. Edge AI argues for doing work locally; this argues for dividing work across local and remote by what each does best. The brain does both at once: reflexes at the edge, deliberation at the centre, constant traffic between them.

Brain ↔ distributed AI, side by side

In the nervous system	In a multi-model AI system	What it buys you
Reflex arc (spinal cord)	Small on-device model / fast local check	Instant response, no round-trip
Amygdala "low road"	Cheap detector biased to catch everything	Never miss the rare, costly event
Cortex deliberation	Large model in the cloud	Deep, general reasoning when it matters
Thalamus (relay)	Router / gating network / dispatcher	Send each input to the right specialist
Specialised cortical regions	Experts (MoE) or specialist models/agents	Better & cheaper than one generalist
Prefrontal cortex	Orchestrator / lead agent	Plan, delegate, integrate results
Cerebellum (co-processor)	Dedicated tool or accelerator	Offload a narrow job efficiently
Hippocampus	Memory store / retrieval	Carry context across time
Graceful degradation	Fallbacks & redundancy	One part fails, the system survives

Why a combination beats a monolith

🗿 One giant model for everything

Pays its full cost on every request, trivial or hard
One failure mode can affect everything
Hard to update one skill without touching all
Sensitive data must go wherever the model lives
Opaque — difficult to see why it decided

🤝 Many small models cooperating

Right-sized — cheap path for easy cases, escalate only when needed
Robust — a failed specialist degrades gracefully, with fallbacks
Composable — swap or upgrade one model without retraining the rest
Private — keep sensitive steps on-device, send only what's necessary
Explainable — you can inspect each part's contribution

These advantages compound. A society of models can put the privacy-sensitive, latency-critical work at the edge and reserve the cloud for genuinely hard reasoning — getting the privacy and speed of local with the power of remote, and paying the big cost only when the problem deserves it.

The honest part: cooperation is hard

None of this is free, and it would be dishonest to pretend otherwise. Coordinating many models introduces problems a monolith doesn't have:

Orchestration overhead. Splitting and recombining work has its own cost; for easy tasks, one model is simply simpler.
Communication. The parts need a shared "language" — protocols, schemas, well-defined interfaces — or they misunderstand each other.
Error propagation. A confident-but-wrong specialist can mislead the orchestrator. Systems need verification and skepticism, much as the cortex overrides a jumpy amygdala.
Control and coherence. Who arbitrates when specialists disagree? Neuroscience even has a name for the deep version of this — the binding problem: how does distributed processing produce one unified experience? Multi-agent systems face their own milder version of it every day.

The reason to be optimistic is that biology solved versions of all of these, which tells us they're solvable — and that the answers tend to look like routing, hierarchy, feedback loops and graceful fallback, exactly the tools software is now reaching for.

CricCuts: a tiny society of models you can hold

You don't need a data centre to see the pattern. CricCuts — a free, on-device cricket video editor — is a small working instance of it. It doesn't throw one giant model at "find the highlights." It assembles a little team: fast classical signal processing to spot candidate moments, a tiny neural voice-activity model to tell a shouted "shot!" from a bat-crack, an optional pose model and a speech recogniser for when you want more, and a coordinator that weighs their evidence — with you closing the loop by confirming a few clips. Cheap specialists, summoned only when they earn their place, cooperating to do one job well, entirely on your phone.

It's the same philosophy as the brain and the same philosophy as a well-built agent system, just at pocket scale: prefer the smallest capable part, keep each decision explainable, escalate only when needed, and let the whole be smarter than any piece. If you'd like the full walkthrough, our interactive course on how a phone watches cricket and cuts the highlights itself takes you through every specialist in the team.

Where this is heading

Picture your personal AI not as a single app you talk to, but as a distributed nervous system spread across your devices and the cloud: instant reflexes on your phone and watch, specialists for vision, speech and health, memory that persists, and a coordinator that pulls in a powerful remote model only for the genuinely hard problems — keeping the private, urgent, everyday work local. Less "one oracle in a warehouse," more "a calm, well-organised team that mostly works quietly in the background, on your side and on your hardware."

The future of AI probably isn't a bigger brain. It's a better-organised one — many small models, each excellent at its job, wired together with the same elegant division of labour that's been running in your head all along.

Glossary

Model: A program trained on data to turn inputs into outputs (e.g. audio in, "is this speech?" out).
Training vs inference: Training is the expensive one-time process of teaching a model from data; inference is running the finished model to get an answer — the part that happens every time you use it.
Latency: How long you wait for a response. Local work has low latency; a round-trip to a server adds more.
Edge / on-device: Running a model on the user's own hardware (phone, watch, camera) rather than on a remote server.
Mixture-of-Experts (MoE): One model built from many specialist sub-networks plus a router that activates only a few per input — big-model knowledge at small-model running cost.
Router / gating network: The component that decides which expert(s) or specialist should handle a given input — the system's "thalamus."
Agent: A model given a goal and the ability to take actions (run tools, edit files, search) in a loop until the goal is met.
Orchestration: A lead agent planning a task, delegating sub-tasks to other agents/tools, and combining the results.
Neuro-symbolic: Combining neural networks (perception, fuzzy matching) with classical rule-based logic (exact, fast, explainable).
Reflex arc: A fast, hard-wired sensor→spinal-cord→muscle loop that acts without involving the brain — biology's "tiny model at the edge."
Graceful degradation: When one part fails, the overall system loses a little capability instead of collapsing entirely.
Binding problem: The neuroscience puzzle of how many distributed brain processes combine into a single, unified experience — the hardest form of "how do the parts cooperate?"

See a small society of models in action

CricCuts puts a cooperating team of small, on-device models in your hand — free, private and offline. It's edge AI and multi-model design you can actually hold.

Get the app → How it works

More on the CricCuts blog — start with why the future of AI is small models on the edge.