CricCuts is a free, intelligent cricket video editor for Android that automatically turns raw net and match footage into a clean highlights reel. It finds every bat-on-ball moment, scores the best shots and exports a shareable reel — entirely on your phone, with no cloud upload.

What AI models does CricCuts use?

CricCuts runs small, open-source models on-device: MediaPipe Pose Landmarker for optional body/pose checks, Vosk for offline speech recognition (commentary keyword tagging), and Silero VAD (via ONNX Runtime) for voice-activity detection. Most of the pipeline is classical signal processing; the neural models are used surgically where they add the most value.

Does CricCuts upload my videos to the cloud?

No. All analysis and rendering happens on-device. Your footage never leaves your phone, which keeps it private, works offline, and avoids data-centre emissions.

Yes. Because everything runs on-device there are no server bills to pass on, and there is no account wall to try it.

How CricCuts Works — An Interactive Course on On-Device Cricket AI

Module 005 min · Orientation

The big picture: AI is moving to the edge

You hand CricCuts one or more phone videos of a session. Minutes later you get a tidy reel of your actual shots — the bat-on-ball moments — with the gear-up, the chatter and the dead air trimmed away. No upload, no editor, no waiting on a server. The interesting part isn't just that it works; it's where it works: every step happens on the device in your pocket.

For a decade, "AI" mostly meant "send your data to a giant model in a data centre and wait for an answer." That era is ending. The most exciting shift in technology right now is the rise of small, specialised models that run directly on your own device — your phone, your watch, your camera. This is often called edge AI, and CricCuts is built squarely on it.

🔒

Privacy by default

Your footage never leaves your phone. Nothing is uploaded, stored on a server, or used to train anyone's model.

⚡

Instant & offline

No round-trip to a server means results in seconds — even on airplane mode, even from a basement ground with no signal.

💸

Genuinely free

No cloud GPUs to rent means no bill to pass on. On-device is what lets a tool like this stay free for everyone.

🌱

Lighter on the planet

No data-centre compute, no uploading hundreds of megabytes. The work is tiny and local instead of large and remote.

The old trade-off was that on-device models were too weak to be useful. That's no longer true. Modern phones are remarkably capable, and a thoughtfully chosen small model — built to answer a single, narrow question — can out-perform a giant general model at that one job while running in milliseconds. CricCuts is a working proof of that idea applied to cricket: intelligent results, on hardware you already own, for free.

The future of AI isn't only bigger models in bigger data centres. It's the right-sized model running right where the data is — on the edge.

The whole pipeline on one screen

Tap any stage to highlight it. We'll spend a module (sometimes two) on each.

1

Listen

It analyses the soundtrack and isolates the distinctive acoustic signature of bat meeting ball — the cleanest fingerprint of a real shot.

2

Find candidate impacts

Measure loudness over time, set a smart, self-adjusting bar, and pick the spikes that rise above it.

3

Read the context

Around each candidate: was there a build-up before? A reaction after? Plus a small neural voice-activity model that tells real talking from a bat-crack.

4

Score the moment

Combine the evidence into one confidence number, so strong shots rise and noise sinks.

5

Learn from you

You confirm a handful of clips. That re-tunes the whole session to your footage — no model retraining.

6

(Optional) watch the video

Opt in and the app confirms a body actually moved in your batting area at the moment of impact.

7

Build clips & reel

Pad each impact into a watchable clip, remove duplicates, fit your target length, export a branded MP4.

🔑

One idea governs the whole design: the engine enriches, the app filters. No stage ever throws a moment away — weaker ones just rank lower. That's why you can always overrule CricCuts: everything it found is still there for you to keep or drop.

Quick check: why doesn't CricCuts just use a big AI model to "recognise a cover drive"?

The middle one. Recognising shot types would need a big labelled dataset and a heavy model. CricCuts only needs to know that a shot happened — and the acoustic fingerprint of bat-on-ball is far cheaper and more robust to detect than classifying the stroke.

Next: Sound → numbers →

Module 013 min · Foundations

How sound becomes numbers

Everything in the audio half of CricCuts is arithmetic on a long list of numbers. Before any of it makes sense, you need to know what those numbers are.

A microphone measures air pressure. As a bat cracks a ball, the air wobbles; the mic turns those wobbles into a voltage; the phone measures that voltage many thousands of times a second. Each measurement is a sample — a single number. String them together and you have the waveform: a "loudness over time" curve sampled finely enough to reconstruct any sound a human can hear.

📐

Why sample so fast? The Nyquist theorem says to capture a frequency f you must sample at least 2f times a second. Human hearing tops out near 20 kHz, so a CD-style rate captures everything we can hear — including the sharp, high-frequency "crack" of a middled shot. CricCuts mixes stereo down to one mono channel (direction doesn't matter for "did a hit happen") and works from there.

Stitching many videos into one timeline

A real session is often two or three separate recordings. Rather than process each in isolation, CricCuts asks you to mark the batting range in each (the actual playing window), then joins just those ranges into a single virtual timeline — one continuous mono track. Every later step reasons in this virtual time; only playback and export ever map a moment back to "which file, which offset." One coordinate system keeps target-length and learning signals consistent across the whole session.

If a sound contains energy at 3 kHz, what's the minimum sample rate that can capture it?

Nyquist: at least 2× the highest frequency, so 6 kHz. In practice CricCuts works at a much higher, CD-style rate to comfortably cover the whole audible band with margin.

Next: Filtering →

Module 025 min · Signal processing

Isolating the impact — band-pass filtering

A net session is acoustic chaos: wind, traffic rumble, distant chatter, other nets, your own breathing. A big part of finding bat-on-ball is to throw away the frequencies it doesn't live in before you even start looking.

Bat-on-ball impact is a short, sharp transient whose energy concentrates in the mid frequencies. Low rumble (wind, footsteps, handling noise) sits below it; a lot of hiss sits above. So CricCuts runs the audio through a band-pass filter — a circuit-in-software that lets a chosen band through and attenuates everything outside it. The steeper the filter's "skirts," the more cleanly the noise above and below is rejected.

The demo below shows the idea: drag the cutoffs and watch the green passband move, and notice how everything outside it gets pushed down. This is a generic, illustrative band-pass — the exact band CricCuts uses for impact detection is tuned separately from the ones it uses for the low-frequency build-up and the voice range.

LiveBand-pass filter — frequency responseillustrative

Low cutoff 500 Hz

High cutoff 3500 Hz

The x-axis is frequency (log scale, 50 Hz → 20 kHz); the y-axis is how much each frequency survives (0 dB = untouched, lower = attenuated). Grey markers show typical sources. Keeping only the band where impacts live is what stops wind below and hiss above from drowning the signal.

Why does a steeper filter "skirt" help?

A gentle filter lets a lot of nearby noise bleed through. A steeper roll-off acts more like a cliff at the band edges — exactly what keeps wind and hiss out of the signal you're about to search.

Next: Finding the hit →

Module 036 min · Detection

Finding the hit — energy, an adaptive bar, and peak-picking

After filtering we have a clean-ish signal. Now: where, in twenty minutes, are the actual impacts? Three steps — measure loudness, set a smart threshold, pick the spikes.

Step 1 — a loudness curve

We slide a small window across the signal and measure how much in-band energy it contains at each moment. The result is a compact loudness curve — a summary of how loud things are over time, computed efficiently in a single pass.

Step 2 — an adaptive threshold

A fixed loudness bar fails instantly: phones auto-adjust their gain, sessions get louder as shots cluster, and every venue is different. Instead CricCuts tracks a rolling baseline of recent loudness and sets the bar a little above it. Using a robust baseline (a typical, middle value rather than a simple average) means a couple of loud spikes don't drag the bar around — so it tracks the genuine background, not the shots themselves.

Step 3 — peak-picking with non-maximum suppression

Every local maximum above the bar is a candidate. But one impact "rings" for a moment and can cross the bar several times. So we apply non-maximum suppression (NMS): accept the strongest peak, then suppress weaker ones too close to it in time, so a single hit registers once rather than five times.

Play with it. The demo below is an illustrative loudness curve with real impacts buried in noise. Drag the threshold and the NMS window and watch recall (real shots found) trade off against false positives. This tension is the heart of the whole detector.

LiveAdaptive threshold & peak-pickingillustrative

Threshold above baseline +6.0 dB

NMS window 800 ms

Real shots found

–

False positives

–

Recall

–

Green dots = true impacts caught · hollow = missed · red = noise picked up by mistake. Lower the bar and you catch everything plus junk; raise it and you get clean picks but miss soft shots. CricCuts deliberately leans toward recall — the app sorts the junk to the bottom rather than risk losing a real shot.

Why use a robust baseline (a typical middle value) rather than a plain average for the bar?

A handful of loud impacts would pull an average upward — raising the bar so the next shot is missed. A robust baseline is unmoved by those outliers, so the threshold tracks the genuine background level.

Next: Context →

Module 045 min · Feature engineering

Reading the context around each candidate

A loud spike isn't proof of a shot — a dropped bat, a gate clang, or the next net's hit all spike too. What separates a real shot is the story around it in time.

Think of a genuine shot as a little three-act play: a build-up before (footsteps, a run-up, the ambient lift), the impact itself, and a reaction after ("shot!", "ohh", a laugh). Each act lives in a different part of the sound and a different slice of time around the moment of impact. CricCuts measures a small set of cheap context cues per candidate and uses them as evidence.

Cue	What it captures
Build-up	The low-frequency lift before a shot — footsteps, a run-up, the ambient rise. A real build-up is consistent and rising; random rumble flickers.
Reaction	A short burst of voice just after the shot. A genuine "shot!" peaks sharply; continuous commentary stays flat.
Surrounding chatter	How much sustained talking surrounds the impact — used to keep chatter-heavy clips out of the way.
Lead-in speech	Whether the clip would open mid-word. A high value = poor viewing, so it's discouraged.

🗣️

All of these read voice-range energy — and energy alone cannot tell a shouted word from a bat-crack (both are loud in the voice range). So CricCuts also runs a small neural voice-activity model that answers the one question energy can't: "is this spike actually someone talking, or a hit?" We give it a full teardown in Module 9.

🔑

Why this beats a big neural net here: each cue is a few cheap reads of the waveform — microseconds of work, fully explainable. For dozens of candidate impacts the whole context pass finishes in well under a second on a mid-range phone, and you can always reason about why a clip scored the way it did.

Why does a "reaction" cue help separate real shots from background noise?

A sharp burst of voice just after impact is a strong, cheap signal that something worth keeping just happened — quite different from the flat drone of constant chatter.

Next: Scoring →

Module 055 min · Scoring logic

Scoring a shot — strict beats forgiving

We now have, per candidate, a few pieces of evidence — an impact strength, a build-up cue, a reaction cue (each between 0 and 1). How do we fuse them into one confidence? The way you combine them encodes a whole philosophy.

A forgiving average vs a strict combine

A plain average is forgiving: a brilliant build-up can paper over a missing reaction. A strict combine (think multiplying the ingredients rather than adding them) is unforgiving — if any ingredient is near zero, the whole score is near zero. For "are all the signs of a real shot present?", strictness is exactly what we want: a moment with no impact isn't a shot, full stop.

CricCuts leans on a strict combine for its core confidence, with sensible fallbacks for the cases where one cue is genuinely, legitimately missing (a self-feed net with no bowler run-up, say). Drive it yourself: set the three ingredients and watch how much harsher the strict combine is than a plain average.

LiveStrict combine vs forgiving averageillustrative

Impact strength 0.80

Build-up 0.60

Reaction 0.70

Strict combine

–

Plain average

–

Try this: set impact 0.9, reaction 0.9, build-up 0.0. The plain average stays comfortable, but the strict combine craters toward 0 — because one missing ingredient should make the whole "is-this-a-shot?" answer doubtful. That strictness is the point; CricCuts then handles the genuinely-missing-cue cases separately so real shots aren't punished unfairly.

Impact 0.9, reaction 0.9, build-up 0.0. Why does the strict combine come out near 0?

A strict combine multiplies the ingredients, so any near-zero value drags the result to zero. That's deliberate — and it's exactly why CricCuts has dedicated handling for the cases where a cue is legitimately absent.

Next: Curation →

Module 065 min · Active learning

Learning from you — curation in a few taps

No fixed setting is right for every session. So CricCuts asks you to label a handful of carefully chosen clips, then propagates that judgement across the whole session — without retraining any model. This is active learning: the machine picks the most informative questions; you answer a few.

Which clips does it ask about?

Labelling random clips wastes your taps. CricCuts deliberately picks ones that resolve the most uncertainty — for example the most confident pick (a sanity check), a borderline one right on the fence (where your verdict settles the most ambiguity), one from a dense cluster, and an odd-one-out. A few well-chosen taps teach it far more than a dozen random ones.

How your verdict spreads

Each moment can be described by a few of its characteristics, which place it as a point in an abstract "similarity space" — clips that sound alike sit close together. When you mark a clip, your verdict ripples outward to its neighbours by closeness: confirm a great shot and acoustically similar ones get a lift; flag a false alarm and look-alikes get pushed down. One tap can rescue every similar borderline shot, or quash a whole class of background noise, across the session.

Below is a flattened picture of that idea. Each dot is a detected moment; nearer dots are more alike. Pick Boost then click a dot to lift its neighbours, or Suppress then a dot to push its neighbours down. The effect fades with distance — far-away, dissimilar moments are barely touched.

LiveCuration in similarity spaceillustrative

Pick a mode, then click any dot.

Brightness = current confidence. One label radiates a boost (or penalty) to nearby, similar moments — so a single tap can lift every look-alike shot, or quash a whole class of background noise, at once.

You mark one clip as a great shot. What happens to a far-away, acoustically different moment?

Boosts and penalties scale with closeness in similarity space. A distant, dissimilar moment is essentially untouched; only look-alikes near the labelled point feel the effect.

Next: Video →

Module 075 min · Computer vision

Seeing the shot — the optional video pass

Audio alone is fast and usually enough. But when you want extra certainty, CricCuts can look at the frames and confirm a body actually moved in your batting area at the moment of impact. This pass is opt-in, because it's heavier — and because which visual approach works depends on the camera angle.

Distance changes everything

You tell the app where the camera is, and that picks the strategy. From the bowler's end or side-on, the whole body is in frame, so a pose model can help. From the batter's end, the camera is close and only feet, pads and the bat's downswing show — so pose is off the table and CricCuts leans on motion instead.

Camera	What's in frame	Video signal used
Bowler's end / Side-on	full body, far	Motion + presence; optional deep pose (needs the body visible)
Batter's end	close — feet, pads, bat downswing	Motion + presence only (pose is off — the upper body is out of frame)

Motion plus a presence check

You draw a box around the batter once. For each impact, CricCuts samples a few frames around the moment, measures how much the pixels inside the box change (a shot is a burst of change), and separately checks that the framing still matches your reference — that the camera is genuinely still on the batter. Motion says "something moved"; presence says "and it was still the batter." Both must agree, which is what stops a whole-frame camera pan from faking a shot.

The deep path: pose landmarks

If you opt into deep verification at a far angle, a pose model finds the body's key joints in each frame, and CricCuts derives swing-like cues from the wrists and shoulders. We meet that model — MediaPipe Pose — properly in the next module.

Why combine motion with a presence check instead of using motion alone?

Motion alone confirms something moved — including the whole frame during a pan. The presence check ensures the motion happened while the framing still matched the batter, so only real in-box movement counts.

Next: Fusion →

Module 084 min · Decision logic

Fusing every signal into one decision

Audio confidence, optional video confidence, spoken keywords — each moment collects its evidence, and one final number decides whether it gets auto-ticked onto your reel.

The fusion is built on a few deliberate principles, without giving away the dials:

Audio leads. The acoustic evidence is the backbone; a strong, clean audio shot can stand on its own.
Video can pull a score down, not just up. Because the video signal measures "did the batter actually move," a loud audio spike with a motionless batter is demoted below the auto-select line — a likely false positive caught.
But video can't erase a strong audio event. One occluded or missed frame can soften a great audio shot, never delete it. Evidence degrades gracefully; it doesn't veto catastrophically.
Keywords nudge. If a word from your commentary list ("four!", "out!", "shot!") lands near a moment, it gets a small bump — that's the speech model's job, next module.

🔑

Auto-select, not auto-delete. Fusion only decides the default tick. Every moment still exists; you can flip any of them. The engine enriches; you decide.

A loud spike, but the batter is motionless on video. What happens?

Video can demote. A motionless batter drops the fused score under the auto-select line, so the likely false positive isn't pre-ticked — yet the enrich-not-filter rule means the moment is never destroyed.

Next: The models →

Module 0910 min · The ML models

The models up close — pose, speech & voice activity

Most of CricCuts is classical signal processing — deliberately, because it's fast, explainable and needs no training data. But three genuine, open-source machine-learning models ride along: a body model, a words model, and a voice model. Here's how each is built, what it can do (including things CricCuts doesn't use), and where it breaks.

① MediaPipe Pose Landmarker (the body model)

Google's open-source MediaPipe Pose Landmarker — a small neural network that finds the body's key landmarks (eyes, shoulders, elbows, wrists, hips, knees, ankles…) in an image.

How it works

Two-stage design. A lightweight detector first finds the person's bounding box; a landmark network then locates the joints inside it. Splitting the job keeps each network small and fast.
Convolutional backbone trained on a large, diverse human-pose dataset. For each landmark it outputs a position plus a visibility score — how confident it is the joint is actually in frame.
On-device. It runs locally, costing a few seconds per event on a phone — which is why CricCuts keeps it opt-in rather than always-on.

What it can do — including things CricCuts doesn't use

A full 3-D skeleton (a depth estimate per joint), multi-person tracking, segmentation masks, and real-time tracking on mid-range phones.
It's the backbone for fitness rep-counters, gesture interfaces, AR avatars and physiotherapy tools — anything that needs a cheap skeleton. CricCuts taps only a sliver: a few wrist/shoulder cues and a tidy bounding box.

Limitations (why it's optional, not core)

Needs the body in frame. At the batter's end only feet/pads show, so pose can't help there.
Small, distant or occluded figures (the norm in nets) give noisy landmarks; low-visibility joints are discarded to avoid garbage.
No cricket knowledge. It knows "wrist," not "cover drive." All cricket meaning is added by CricCuts' own logic on top.

② Vosk speech recognition (the words model)

Vosk — a compact, offline Automatic Speech Recognition (ASR) toolkit. CricCuts uses a small English model to power on-demand commentary-keyword tagging: transcribing the words around a shot so it can tag the clip "boundary," "wicket," and so on.

How it works

Kaldi lineage. Vosk wraps the well-known Kaldi speech toolkit. It's a classic ASR stack: an acoustic model (audio → speech-sound probabilities), a pronunciation lexicon, and a language model (which word sequences are plausible), decoded together to produce text.
Compact & offline. The "small" models trade a little accuracy for size and speed, so the whole thing runs on a phone with no network.
Keyword-focused. Rather than transcribe everything, CricCuts points Vosk at the active keyword list. Listening for a short, known set of words is faster and more accurate than open dictation.

What it can do — including unused capabilities

Full continuous dictation, word-level timestamps, alternative hypotheses, streaming partial results, and dozens of languages via swappable models.
CricCuts uses only word-spotting plus timestamps for tags today; the broader dictation power sits in reserve (a full, searchable match transcript is a natural future feature).

Limitations

Noisy outdoor audio — wind, crowd, distance — degrades recognition; the keyword focus helps a lot.
Accent- and vocabulary-bound to its training; a small model is the accuracy floor.
Compute cost — transcribing a whole match on-device is slow, which is exactly why tagging is opt-in and scoped to the clips you pick, not run during the first pass.

③ Silero VAD (the voice model), via ONNX Runtime

Silero VAD — a tiny, open-source neural Voice-Activity-Detection model, run through ONNX Runtime (Microsoft's cross-platform engine for running models in the open ONNX format). Where Vosk asks the expensive question "which words?", VAD asks the cheap one: "is someone speaking right now — yes or no?" Because that question is so much smaller, the model is small and fast enough to run over the whole recording in the very first pass, producing a complete speech timeline.

🔑

The one problem only this model can solve. Every energy-based speech cue (Module 4) just measures loudness in the voice range — and a shouted "shot!" and a bat-on-ball crack are both loud there. Energy genuinely cannot tell them apart. A model trained on the structure of the human voice (its pitch and harmonics) can: at a real crack its speech probability is near zero; during talking it's near one.

How it works

A tiny recurrent network. It takes a short audio chunk and emits one number — speech probability — while carrying a memory of recent context forward. So each decision uses what came just before, not only the instant: it "knows" it's mid-sentence.
Runs through ONNX Runtime on the CPU. Shipping in the open ONNX format means the same model file runs across platforms through one well-optimised engine.

Interesting ideas it shows off

VAD vs ASR. Recognising words means searching a huge space of word sequences — expensive. Detecting voice is a tiny yes/no — cheap enough to be always-on. Picking the smallest model that answers your actual question is the whole game.
Smoothing jitter into segments. A raw per-frame probability is jittery, so it's smoothed — using a little hysteresis (a "dead band," the same trick a thermostat uses) — into clean speech segments, so one sentence's natural pauses don't shatter it into fragments.
Language-agnostic. It detects voice, not English — so it works on any commentary, any accent, any language, with no extra model.
Fail-safe. Any hiccup just yields an empty timeline, and the pipeline quietly falls back to its energy-only behaviour. The model can only help, never break analysis.

Play with it below. The scene has a stretch of conversation, a bat-crack, and an exclamation. The loud talk and the crack both trip the energy detector — but watch the VAD curve tell them apart. Drag the threshold to see where it decides "speech."

LiveSpeech vs a bat-crack — what the VAD hearsillustrative

Speech threshold 0.55

Loud-talk spike

–

Bat-crack spike

–

Top track = impact energy (what the peak-picker sees) — it spikes on the loud talking and the crack. Bottom track = the VAD speech probability; shaded bands are detected speech. The crack lands in a gap (probability ≈ 0) → kept as a likely hit; the talk spike sits inside a speech run → recognised as talking, not a shot.

Limitations

It hears voice, not cricket. Loud singing, a PA announcement or a crowd chant can read as "speech" — it's a voice detector, not a commentary understander.
Very short shouts can fall below its minimum and be dropped as blips.
Runtime weight. The runtime adds some native code per device architecture, so the app ships only the slice each phone needs.

🧠

The deliberate philosophy: heavy models are used surgically — pose only when you ask, word-recognition only on the clips you pick. The one model cheap enough to run always-on (VAD) does so precisely because it answers the smallest possible question. The rest of the always-on path is plain, transparent signal processing. That's what keeps the default analysis fast, fully offline, on a phone that might be five years old.

A loud shout and a bat-crack are both loud in the voice range. How does the VAD model tell them apart when energy can't?

Loudness and band don't separate them — both are loud and in-range. The model recognises the shape of voiced speech, which a percussive transient simply doesn't have. That's the one question energy heuristics can't answer.

Why point Vosk at a keyword list instead of full dictation?

Vosk is perfectly capable of full dictation — but focusing it on the active keyword list narrows the search massively, making tagging both faster and more robust to net-session noise. The broader capability is held in reserve.

Next: Noise reduction →

Module 104 min · Synthesis

Noise reduction, everywhere

"Noise reduction" in CricCuts isn't one filter — it's defence in depth. A net session is one of the most hostile audio environments imaginable: multiple nets, wind, traffic, constant chatter. Several independent layers each fight a different part of it.

Layer	Removes
Band-pass filter	wind, rumble and hiss — physically attenuated before detection (Module 2).
Adaptive baseline threshold	slow level drift and the ambient floor — steady noise never trips the bar, only spikes above it do (Module 3).
NMS peak-picking	an impact's "ringing" duplicates — one hit registers once (Module 3).
Reaction / peakedness cues	continuous commentary — flat talking scores low where a sharp burst scores high (Module 4).
Neural VAD model	vocal-only spikes & pure conversation — a trained voice detector catches what energy alone can't (Module 9 ③).
Lead-in discouragement	clips that would open mid-word (Module 4).
Curation similarity	a whole class of recurring noise — mark one, every look-alike is suppressed (Module 6).
Presence check (video)	camera pans / framing loss — motion only counts while the batter is still framed (Module 7).

🛡️

No single layer is perfect — together they're robust. Wind beats the threshold? The band-pass already killed it. Chatter sneaks through the band-pass? The reaction cue and VAD catch it. The next net spikes? Curation similarity suppresses it. Each layer covers another's blind spot.

Constant crowd chatter survives the band-pass (it's in the voice range). What stops it being scored as shots?

Voice-range chatter does pass the filter — that's why the reaction/peakedness cue exists (continuous talking is flat, not bursty) and why the VAD model is there to recognise sustained speech for what it is.

Next: Clips → reel →

Module 115 min · From signals to a shareable reel

Clips → reel → a lighter planet

A scored list of impacts isn't a highlights reel. The last mile turns moments into watchable clips, removes duplicates, fits your chosen length, and exports a branded MP4 — and, as a bonus, saves a measurable amount of carbon.

From an instant to a clip

An impact is a point in time; a clip is a window. CricCuts pads each impact bias-aware — a little more time after the hit (to catch the ball flight and the shout) than before — and keeps that feel even when a clip is stretched to a minimum length, so the impact stays where it belongs rather than drifting to the centre.

Clusters — one moment, one row

Overlapping nets and impact-ringing can produce several events for one physical shot. CricCuts groups nearby events into a cluster, picks the strongest as the anchor, and shows one merged row — so the timeline reflects shots, not duplicates.

Fitting your target length

Ask for a 60-second reel and CricCuts packs the best clusters until the budget is spent, then trims to hit your target exactly. You choose the feel:

Equal — every clip shares one ceiling; the longest lose the most.
Intelligent (default) — trims the biggest clips first and protects short shots, so a crisp 2-second shot is never sliced to fit a sprawling one.

The same logic powers the preview and the final render — what you preview is what you export.

Export — branded, upright, intact

The reel is encoded with Google's hardware-accelerated Media3 Transformer (no heavyweight transcoder), stitching the clips, stamping the CricCuts watermark (and optional QR / handle), and rotating portrait phone video so it comes out upright with the branding in the right corner. Capture date and location metadata are carried onto the output so your highlights stay organised in the gallery.

The carbon bonus

Because every frame is processed on your phone, you never upload a multi-hundred-megabyte raw session to a cloud editor. CricCuts estimates the emissions you avoid — the network transfer and data-centre compute you didn't spend, minus the tiny on-device cost — and shows the grams of CO₂ saved on every export. A small, quantified reminder that edge AI is lighter on the planet, printed right on the share caption.

"Intelligent" trim mode protects short shots by…

Intelligent trim takes time from the largest clips first, and a floor stops any clip being chopped below a watchable minimum — so short, crisp shots survive intact. "Equal" is the make-them-all-the-same alternative.

🎓 Course complete

You've followed a cricket highlight from air-pressure wobbles to a shareable reel: sampling, band-pass filtering, loudness and adaptive thresholding, peak-picking, temporal context, strict scoring, active-learning curation, pose & motion video confirmation, multi-signal fusion, layered noise reduction, and length-aware export — every bit of it running offline on the phone in your pocket.

The throughline: prefer the simplest signal that does the job, keep every decision explainable, and never throw the user's data away. That's how a phone watches cricket and cuts the highlights itself.

🏏

Want to see it on your own footage? Grab the app and run a net session through it — free, on your phone, no cloud. Everything in this course happens in the few seconds after you tap analyse.

↑ Back to start

How a phone watches cricket and cuts the highlights itself

The big picture: AI is moving to the edge

Privacy by default

Instant & offline

Genuinely free

Lighter on the planet

The whole pipeline on one screen

How sound becomes numbers

Stitching many videos into one timeline

Isolating the impact — band-pass filtering

Finding the hit — energy, an adaptive bar, and peak-picking

Step 1 — a loudness curve

Step 2 — an adaptive threshold

Step 3 — peak-picking with non-maximum suppression

Reading the context around each candidate

Scoring a shot — strict beats forgiving

A forgiving average vs a strict combine

Learning from you — curation in a few taps

Which clips does it ask about?

How your verdict spreads

Seeing the shot — the optional video pass

Distance changes everything

Motion plus a presence check

The deep path: pose landmarks

Fusing every signal into one decision

The models up close — pose, speech & voice activity

① MediaPipe Pose Landmarker (the body model)

How it works

What it can do — including things CricCuts doesn't use

Limitations (why it's optional, not core)

② Vosk speech recognition (the words model)

How it works

What it can do — including unused capabilities

Limitations

③ Silero VAD (the voice model), via ONNX Runtime

How it works

Interesting ideas it shows off

Limitations

Noise reduction, everywhere

Clips → reel → a lighter planet

From an instant to a clip

Clusters — one moment, one row

Fitting your target length

Export — branded, upright, intact

The carbon bonus

🎓 Course complete