How a phone watches cricket and cuts the highlights itself
A guided, interactive tour through the signal processing and small on-device AI models that turn a raw
net or match recording into a clean reel — entirely on your phone, no cloud. Built for the curious, with demos you can
poke and a quick quiz at the end of each module. No maths degree required.
You hand CricCuts one or more phone videos of a session. Minutes later you get a tidy reel of your
actual shots — the bat-on-ball moments — with the gear-up, the chatter and the dead air trimmed away. No upload, no
editor, no waiting on a server. The interesting part isn't just that it works; it's where it
works: every step happens on the device in your pocket.
For a decade, "AI" mostly meant "send your data to a giant model in a data centre and wait for an answer." That
era is ending. The most exciting shift in technology right now is the rise of small, specialised models that
run directly on your own device — your phone, your watch, your camera. This is often called edge
AI, and CricCuts is built squarely on it.
🔒
Privacy by default
Your footage never leaves your phone. Nothing is uploaded, stored on a server, or used to train anyone's model.
⚡
Instant & offline
No round-trip to a server means results in seconds — even on airplane mode, even from a basement ground with no signal.
💸
Genuinely free
No cloud GPUs to rent means no bill to pass on. On-device is what lets a tool like this stay free for everyone.
🌱
Lighter on the planet
No data-centre compute, no uploading hundreds of megabytes. The work is tiny and local instead of large and remote.
The old trade-off was that on-device models were too weak to be useful. That's no longer true. Modern phones are
remarkably capable, and a thoughtfully chosen small model — built to answer a single, narrow question — can
out-perform a giant general model at that one job while running in milliseconds. CricCuts is a working proof of that
idea applied to cricket: intelligent results, on hardware you already own, for free.
The future of AI isn't only bigger models in bigger data centres. It's the right-sized model running
right where the data is — on the edge.
The whole pipeline on one screen
Tap any stage to highlight it. We'll spend a module (sometimes two) on each.
1
Listen
It analyses the soundtrack and isolates the distinctive acoustic signature of bat meeting ball — the cleanest fingerprint of a real shot.
2
Find candidate impacts
Measure loudness over time, set a smart, self-adjusting bar, and pick the spikes that rise above it.
3
Read the context
Around each candidate: was there a build-up before? A reaction after? Plus a small neural voice-activity model that tells real talking from a bat-crack.
4
Score the moment
Combine the evidence into one confidence number, so strong shots rise and noise sinks.
5
Learn from you
You confirm a handful of clips. That re-tunes the whole session to your footage — no model retraining.
6
(Optional) watch the video
Opt in and the app confirms a body actually moved in your batting area at the moment of impact.
7
Build clips & reel
Pad each impact into a watchable clip, remove duplicates, fit your target length, export a branded MP4.
🔑
One idea governs the whole design:
the engine enriches, the app filters. No stage ever throws a moment away — weaker ones just rank lower. That's
why you can always overrule CricCuts: everything it found is still there for you to keep or drop.
Quick check: why doesn't CricCuts just use a big AI model to "recognise a cover drive"?
The middle one. Recognising shot types would need a big labelled dataset and a heavy model. CricCuts only needs to know that a shot happened — and the acoustic fingerprint of bat-on-ball is far cheaper and more robust to detect than classifying the stroke.
Everything in the audio half of CricCuts is arithmetic on a long list of numbers. Before any of it
makes sense, you need to know what those numbers are.
A microphone measures air pressure. As a bat cracks a ball, the air wobbles; the mic turns those wobbles into a
voltage; the phone measures that voltage many thousands of times a second. Each measurement is a sample — a
single number. String them together and you have the waveform: a "loudness over time" curve sampled
finely enough to reconstruct any sound a human can hear.
📐
Why sample so fast? The
Nyquist theorem says to capture a frequency f you must sample at least 2f times a second. Human
hearing tops out near 20 kHz, so a CD-style rate captures everything we can hear — including the sharp,
high-frequency "crack" of a middled shot. CricCuts mixes stereo down to one mono channel (direction doesn't
matter for "did a hit happen") and works from there.
Stitching many videos into one timeline
A real session is often two or three separate recordings. Rather than process each in isolation, CricCuts asks
you to mark the batting range in each (the actual playing window), then joins just those ranges into a single
virtual timeline — one continuous mono track. Every later step reasons in this virtual time; only
playback and export ever map a moment back to "which file, which offset." One coordinate system keeps target-length
and learning signals consistent across the whole session.
If a sound contains energy at 3 kHz, what's the minimum sample rate that can capture it?
Nyquist: at least 2× the highest frequency, so 6 kHz. In practice CricCuts works at a much higher, CD-style rate to comfortably cover the whole audible band with margin.
A net session is acoustic chaos: wind, traffic rumble, distant chatter, other nets, your own
breathing. A big part of finding bat-on-ball is to throw away the frequencies it doesn't live in
before you even start looking.
Bat-on-ball impact is a short, sharp transient whose energy concentrates in the mid frequencies.
Low rumble (wind, footsteps, handling noise) sits below it; a lot of hiss sits above. So CricCuts runs the audio
through a band-pass filter — a circuit-in-software that lets a chosen band through and attenuates
everything outside it. The steeper the filter's "skirts," the more cleanly the noise above and below is rejected.
The demo below shows the idea: drag the cutoffs and watch the green passband move, and notice how everything
outside it gets pushed down. This is a generic, illustrative band-pass — the exact band CricCuts uses for impact
detection is tuned separately from the ones it uses for the low-frequency build-up and the voice range.
LiveBand-pass filter — frequency responseillustrative
The x-axis is frequency (log scale, 50 Hz → 20 kHz); the y-axis is how much each
frequency survives (0 dB = untouched, lower = attenuated). Grey markers show typical sources. Keeping only
the band where impacts live is what stops wind below and hiss above from drowning the signal.
Why does a steeper filter "skirt" help?
A gentle filter lets a lot of nearby noise bleed through. A steeper roll-off acts more like a cliff at the band edges — exactly what keeps wind and hiss out of the signal you're about to search.
Finding the hit — energy, an adaptive bar, and peak-picking
After filtering we have a clean-ish signal. Now: where, in twenty minutes, are the actual impacts?
Three steps — measure loudness, set a smart threshold, pick the spikes.
Step 1 — a loudness curve
We slide a small window across the signal and measure how much in-band energy it contains at each moment. The
result is a compact loudness curve — a summary of how loud things are over time, computed
efficiently in a single pass.
Step 2 — an adaptive threshold
A fixed loudness bar fails instantly: phones auto-adjust their gain, sessions get louder as shots cluster, and
every venue is different. Instead CricCuts tracks a rolling baseline of recent loudness and sets the
bar a little above it. Using a robust baseline (a typical, middle value rather than a simple average) means a couple
of loud spikes don't drag the bar around — so it tracks the genuine background, not the shots themselves.
Step 3 — peak-picking with non-maximum suppression
Every local maximum above the bar is a candidate. But one impact "rings" for a moment and can cross the bar
several times. So we apply non-maximum suppression (NMS): accept the strongest peak, then suppress
weaker ones too close to it in time, so a single hit registers once rather than five times.
Play with it. The demo below is an illustrative loudness curve with real impacts buried in noise. Drag the
threshold and the NMS window and watch recall (real shots found) trade off against
false positives. This tension is the heart of the whole detector.
LiveAdaptive threshold & peak-pickingillustrative
Real shots found
–
False positives
–
Recall
–
Green dots = true impacts caught · hollow = missed · red = noise picked up by mistake. Lower the
bar and you catch everything plus junk; raise it and you get clean picks but miss soft shots. CricCuts
deliberately leans toward recall — the app sorts the junk to the bottom rather than risk losing a real
shot.
Why use a robust baseline (a typical middle value) rather than a plain average for the bar?
A handful of loud impacts would pull an average upward — raising the bar so the next shot is missed. A robust baseline is unmoved by those outliers, so the threshold tracks the genuine background level.
A loud spike isn't proof of a shot — a dropped bat, a gate clang, or the next net's hit all spike
too. What separates a real shot is the story around it in time.
Think of a genuine shot as a little three-act play: a build-up before (footsteps, a run-up, the
ambient lift), the impact itself, and a reaction after ("shot!", "ohh", a laugh).
Each act lives in a different part of the sound and a different slice of time around the moment of impact. CricCuts
measures a small set of cheap context cues per candidate and uses them as evidence.
Cue
What it captures
Build-up
The low-frequency lift before a shot — footsteps, a run-up, the ambient rise. A real build-up is consistent and rising; random rumble flickers.
Reaction
A short burst of voice just after the shot. A genuine "shot!" peaks sharply; continuous commentary stays flat.
Surrounding chatter
How much sustained talking surrounds the impact — used to keep chatter-heavy clips out of the way.
Lead-in speech
Whether the clip would open mid-word. A high value = poor viewing, so it's discouraged.
🗣️
All of these read voice-range energy — and energy
alone cannot tell a shouted word from a bat-crack (both are loud in the voice range). So CricCuts also runs a
small neural voice-activity model that answers the one question energy can't: "is this spike actually
someone talking, or a hit?" We give it a full teardown in Module 9.
🔑
Why this beats a big neural net here: each cue
is a few cheap reads of the waveform — microseconds of work, fully explainable. For dozens of candidate impacts the
whole context pass finishes in well under a second on a mid-range phone, and you can always reason about why a
clip scored the way it did.
Why does a "reaction" cue help separate real shots from background noise?
A sharp burst of voice just after impact is a strong, cheap signal that something worth keeping just happened — quite different from the flat drone of constant chatter.
We now have, per candidate, a few pieces of evidence — an impact strength, a build-up cue, a reaction
cue (each between 0 and 1). How do we fuse them into one confidence? The way you combine them encodes a whole
philosophy.
A forgiving average vs a strict combine
A plain average is forgiving: a brilliant build-up can paper over a missing reaction. A
strict combine (think multiplying the ingredients rather than adding them) is unforgiving — if
any ingredient is near zero, the whole score is near zero. For "are all the signs of a real shot present?",
strictness is exactly what we want: a moment with no impact isn't a shot, full stop.
CricCuts leans on a strict combine for its core confidence, with sensible fallbacks for the cases where one cue is
genuinely, legitimately missing (a self-feed net with no bowler run-up, say). Drive it yourself: set the three
ingredients and watch how much harsher the strict combine is than a plain average.
LiveStrict combine vs forgiving averageillustrative
Strict combine
–
Plain average
–
Try this: set impact 0.9, reaction 0.9, build-up 0.0. The plain average stays comfortable, but
the strict combine craters toward 0 — because one missing ingredient should make the whole
"is-this-a-shot?" answer doubtful. That strictness is the point; CricCuts then handles the genuinely-missing-cue
cases separately so real shots aren't punished unfairly.
Impact 0.9, reaction 0.9, build-up 0.0. Why does the strict combine come out near 0?
A strict combine multiplies the ingredients, so any near-zero value drags the result to zero. That's deliberate — and it's exactly why CricCuts has dedicated handling for the cases where a cue is legitimately absent.
No fixed setting is right for every session. So CricCuts asks you to label a handful of carefully
chosen clips, then propagates that judgement across the whole session — without retraining any model. This is
active learning: the machine picks the most informative questions; you answer a few.
Which clips does it ask about?
Labelling random clips wastes your taps. CricCuts deliberately picks ones that resolve the most
uncertainty — for example the most confident pick (a sanity check), a borderline one right on the fence (where
your verdict settles the most ambiguity), one from a dense cluster, and an odd-one-out. A few well-chosen taps teach
it far more than a dozen random ones.
How your verdict spreads
Each moment can be described by a few of its characteristics, which place it as a point in an abstract
"similarity space" — clips that sound alike sit close together. When you mark a clip, your verdict ripples
outward to its neighbours by closeness: confirm a great shot and acoustically similar ones get a lift; flag a false
alarm and look-alikes get pushed down. One tap can rescue every similar borderline shot, or quash a whole class of
background noise, across the session.
Below is a flattened picture of that idea. Each dot is a detected moment; nearer dots are more alike. Pick
Boost then click a dot to lift its neighbours, or Suppress then a dot to push its neighbours down. The
effect fades with distance — far-away, dissimilar moments are barely touched.
LiveCuration in similarity spaceillustrative
Pick a mode, then click any dot.
Brightness = current confidence. One label radiates a boost (or penalty) to nearby, similar
moments — so a single tap can lift every look-alike shot, or quash a whole class of background noise, at once.
You mark one clip as a great shot. What happens to a far-away, acoustically different moment?
Boosts and penalties scale with closeness in similarity space. A distant, dissimilar moment is essentially untouched; only look-alikes near the labelled point feel the effect.
Audio alone is fast and usually enough. But when you want extra certainty, CricCuts can look
at the frames and confirm a body actually moved in your batting area at the moment of impact. This pass is opt-in,
because it's heavier — and because which visual approach works depends on the camera angle.
Distance changes everything
You tell the app where the camera is, and that picks the strategy. From the bowler's end or side-on, the whole
body is in frame, so a pose model can help. From the batter's end, the camera is close and only
feet, pads and the bat's downswing show — so pose is off the table and CricCuts leans on motion instead.
Camera
What's in frame
Video signal used
Bowler's end / Side-on
full body, far
Motion + presence; optional deep pose (needs the body visible)
Batter's end
close — feet, pads, bat downswing
Motion + presence only (pose is off — the upper body is out of frame)
Motion plus a presence check
You draw a box around the batter once. For each impact, CricCuts samples a few frames around the moment, measures
how much the pixels inside the box change (a shot is a burst of change), and separately checks that
the framing still matches your reference — that the camera is genuinely still on the batter. Motion
says "something moved"; presence says "and it was still the batter." Both must agree, which is what stops a whole-frame
camera pan from faking a shot.
The deep path: pose landmarks
If you opt into deep verification at a far angle, a pose model finds the body's key joints in each frame, and
CricCuts derives swing-like cues from the wrists and shoulders. We meet that model — MediaPipe Pose —
properly in the next module.
Why combine motion with a presence check instead of using motion alone?
Motion alone confirms something moved — including the whole frame during a pan. The presence check ensures the motion happened while the framing still matched the batter, so only real in-box movement counts.
Audio confidence, optional video confidence, spoken keywords — each moment collects its evidence,
and one final number decides whether it gets auto-ticked onto your reel.
The fusion is built on a few deliberate principles, without giving away the dials:
Audio leads. The acoustic evidence is the backbone; a strong, clean audio shot can stand on its own.
Video can pull a score down, not just up. Because the video signal measures "did the batter actually
move," a loud audio spike with a motionless batter is demoted below the auto-select line — a likely false positive
caught.
But video can't erase a strong audio event. One occluded or missed frame can soften a great
audio shot, never delete it. Evidence degrades gracefully; it doesn't veto catastrophically.
Keywords nudge. If a word from your commentary list ("four!", "out!", "shot!") lands near a moment, it
gets a small bump — that's the speech model's job, next module.
🔑
Auto-select, not auto-delete. Fusion only
decides the default tick. Every moment still exists; you can flip any of them. The engine enriches; you
decide.
A loud spike, but the batter is motionless on video. What happens?
Video can demote. A motionless batter drops the fused score under the auto-select line, so the likely false positive isn't pre-ticked — yet the enrich-not-filter rule means the moment is never destroyed.
The models up close — pose, speech & voice activity
Most of CricCuts is classical signal processing — deliberately, because it's fast, explainable and
needs no training data. But three genuine, open-source machine-learning models ride along: a body model, a
words model, and a voice model. Here's how each is built, what it can do (including things CricCuts
doesn't use), and where it breaks.
① MediaPipe Pose Landmarker (the body model)
Google's open-source MediaPipe Pose Landmarker — a small neural network that finds the body's
key landmarks (eyes, shoulders, elbows, wrists, hips, knees, ankles…) in an image.
How it works
Two-stage design. A lightweight detector first finds the person's bounding box; a
landmark network then locates the joints inside it. Splitting the job keeps each network small and fast.
Convolutional backbone trained on a large, diverse human-pose dataset. For each landmark it outputs a
position plus a visibility score — how confident it is the joint is actually in frame.
On-device. It runs locally, costing a few seconds per event on a phone — which is why CricCuts keeps it
opt-in rather than always-on.
What it can do — including things CricCuts doesn't use
A full 3-D skeleton (a depth estimate per joint), multi-person tracking, segmentation masks, and real-time
tracking on mid-range phones.
It's the backbone for fitness rep-counters, gesture interfaces, AR avatars and physiotherapy tools — anything
that needs a cheap skeleton. CricCuts taps only a sliver: a few wrist/shoulder cues and a tidy bounding box.
Limitations (why it's optional, not core)
Needs the body in frame. At the batter's end only feet/pads show, so pose can't help there.
Small, distant or occluded figures (the norm in nets) give noisy landmarks; low-visibility joints are
discarded to avoid garbage.
No cricket knowledge. It knows "wrist," not "cover drive." All cricket meaning is added by CricCuts'
own logic on top.
② Vosk speech recognition (the words model)
Vosk — a compact, offline Automatic Speech Recognition (ASR) toolkit. CricCuts uses a
small English model to power on-demand commentary-keyword tagging: transcribing the words around a
shot so it can tag the clip "boundary," "wicket," and so on.
How it works
Kaldi lineage. Vosk wraps the well-known Kaldi speech toolkit. It's a classic ASR stack: an acoustic
model (audio → speech-sound probabilities), a pronunciation lexicon, and a language model (which word sequences are
plausible), decoded together to produce text.
Compact & offline. The "small" models trade a little accuracy for size and speed, so the whole thing
runs on a phone with no network.
Keyword-focused. Rather than transcribe everything, CricCuts points Vosk at the active keyword list.
Listening for a short, known set of words is faster and more accurate than open dictation.
What it can do — including unused capabilities
Full continuous dictation, word-level timestamps, alternative hypotheses, streaming partial results, and dozens
of languages via swappable models.
CricCuts uses only word-spotting plus timestamps for tags today; the broader dictation power sits in reserve (a
full, searchable match transcript is a natural future feature).
Limitations
Noisy outdoor audio — wind, crowd, distance — degrades recognition; the keyword focus helps a lot.
Accent- and vocabulary-bound to its training; a small model is the accuracy floor.
Compute cost — transcribing a whole match on-device is slow, which is exactly why tagging is opt-in and
scoped to the clips you pick, not run during the first pass.
③ Silero VAD (the voice model), via ONNX Runtime
Silero VAD — a tiny, open-source neural Voice-Activity-Detection model, run through
ONNX Runtime (Microsoft's cross-platform engine for running models in the open ONNX format). Where
Vosk asks the expensive question "which words?", VAD asks the cheap one: "is someone speaking right
now — yes or no?" Because that question is so much smaller, the model is small and fast enough to run over
the whole recording in the very first pass, producing a complete speech timeline.
🔑
The one problem only this model can solve.
Every energy-based speech cue (Module 4) just measures loudness in the voice range — and a shouted "shot!" and a
bat-on-ball crack are both loud there. Energy genuinely cannot tell them apart. A model
trained on the structure of the human voice (its pitch and harmonics) can: at a real crack its speech
probability is near zero; during talking it's near one.
How it works
A tiny recurrent network. It takes a short audio chunk and emits one number — speech probability — while
carrying a memory of recent context forward. So each decision uses what came just before, not only the
instant: it "knows" it's mid-sentence.
Runs through ONNX Runtime on the CPU. Shipping in the open ONNX format means the same model file runs
across platforms through one well-optimised engine.
Interesting ideas it shows off
VAD vs ASR. Recognising words means searching a huge space of word sequences — expensive. Detecting
voice is a tiny yes/no — cheap enough to be always-on. Picking the smallest model that answers your actual
question is the whole game.
Smoothing jitter into segments. A raw per-frame probability is jittery, so it's smoothed — using a little
hysteresis (a "dead band," the same trick a thermostat uses) — into clean speech segments, so one sentence's natural
pauses don't shatter it into fragments.
Language-agnostic. It detects voice, not English — so it works on any commentary, any accent,
any language, with no extra model.
Fail-safe. Any hiccup just yields an empty timeline, and the pipeline quietly falls back to its
energy-only behaviour. The model can only help, never break analysis.
Play with it below. The scene has a stretch of conversation, a bat-crack, and an
exclamation. The loud talk and the crack both trip the energy detector — but watch the VAD curve
tell them apart. Drag the threshold to see where it decides "speech."
LiveSpeech vs a bat-crack — what the VAD hearsillustrative
Loud-talk spike
–
Bat-crack spike
–
Top track = impact energy (what the peak-picker sees) — it spikes on the loud talking
and the crack. Bottom track = the VAD speech probability; shaded bands are detected speech. The
crack lands in a gap (probability ≈ 0) → kept as a likely hit; the talk spike sits inside a speech run →
recognised as talking, not a shot.
Limitations
It hears voice, not cricket. Loud singing, a PA announcement or a crowd chant can read as "speech" — it's
a voice detector, not a commentary understander.
Very short shouts can fall below its minimum and be dropped as blips.
Runtime weight. The runtime adds some native code per device architecture, so the app ships only the
slice each phone needs.
🧠
The deliberate philosophy: heavy models are
used surgically — pose only when you ask, word-recognition only on the clips you pick. The one model cheap
enough to run always-on (VAD) does so precisely because it answers the smallest possible question. The rest of
the always-on path is plain, transparent signal processing. That's what keeps the default analysis fast, fully
offline, on a phone that might be five years old.
A loud shout and a bat-crack are both loud in the voice range. How does the VAD model tell them apart when energy can't?
Loudness and band don't separate them — both are loud and in-range. The model recognises the shape of voiced speech, which a percussive transient simply doesn't have. That's the one question energy heuristics can't answer.
Why point Vosk at a keyword list instead of full dictation?
Vosk is perfectly capable of full dictation — but focusing it on the active keyword list narrows the search massively, making tagging both faster and more robust to net-session noise. The broader capability is held in reserve.
"Noise reduction" in CricCuts isn't one filter — it's defence in depth. A net session is one
of the most hostile audio environments imaginable: multiple nets, wind, traffic, constant chatter. Several independent
layers each fight a different part of it.
Layer
Removes
Band-pass filter
wind, rumble and hiss — physically attenuated before detection (Module 2).
Adaptive baseline threshold
slow level drift and the ambient floor — steady noise never trips the bar, only spikes above it do (Module 3).
NMS peak-picking
an impact's "ringing" duplicates — one hit registers once (Module 3).
Reaction / peakedness cues
continuous commentary — flat talking scores low where a sharp burst scores high (Module 4).
Neural VAD model
vocal-only spikes & pure conversation — a trained voice detector catches what energy alone can't (Module 9 ③).
Lead-in discouragement
clips that would open mid-word (Module 4).
Curation similarity
a whole class of recurring noise — mark one, every look-alike is suppressed (Module 6).
Presence check (video)
camera pans / framing loss — motion only counts while the batter is still framed (Module 7).
🛡️
No single layer is perfect — together they're
robust. Wind beats the threshold? The band-pass already killed it. Chatter sneaks through the band-pass? The
reaction cue and VAD catch it. The next net spikes? Curation similarity suppresses it. Each layer covers another's
blind spot.
Constant crowd chatter survives the band-pass (it's in the voice range). What stops it being scored as shots?
Voice-range chatter does pass the filter — that's why the reaction/peakedness cue exists (continuous talking is flat, not bursty) and why the VAD model is there to recognise sustained speech for what it is.
A scored list of impacts isn't a highlights reel. The last mile turns moments into watchable clips,
removes duplicates, fits your chosen length, and exports a branded MP4 — and, as a bonus, saves a measurable amount of
carbon.
From an instant to a clip
An impact is a point in time; a clip is a window. CricCuts pads each impact bias-aware — a little
more time after the hit (to catch the ball flight and the shout) than before — and keeps that feel even when
a clip is stretched to a minimum length, so the impact stays where it belongs rather than drifting to the centre.
Clusters — one moment, one row
Overlapping nets and impact-ringing can produce several events for one physical shot. CricCuts groups nearby
events into a cluster, picks the strongest as the anchor, and shows one merged row — so the timeline
reflects shots, not duplicates.
Fitting your target length
Ask for a 60-second reel and CricCuts packs the best clusters until the budget is spent, then trims to hit your
target exactly. You choose the feel:
Equal — every clip shares one ceiling; the longest lose the most.
Intelligent (default) — trims the biggest clips first and protects short shots, so a crisp 2-second shot
is never sliced to fit a sprawling one.
The same logic powers the preview and the final render — what you preview is what you export.
Export — branded, upright, intact
The reel is encoded with Google's hardware-accelerated Media3 Transformer (no heavyweight
transcoder), stitching the clips, stamping the CricCuts watermark (and optional QR / handle), and rotating portrait
phone video so it comes out upright with the branding in the right corner. Capture date and location metadata are
carried onto the output so your highlights stay organised in the gallery.
The carbon bonus
Because every frame is processed on your phone, you never upload a multi-hundred-megabyte raw session to
a cloud editor. CricCuts estimates the emissions you avoid — the network transfer and data-centre compute you didn't
spend, minus the tiny on-device cost — and shows the grams of CO₂ saved on every export. A small, quantified reminder
that edge AI is lighter on the planet, printed right on the share caption.
"Intelligent" trim mode protects short shots by…
Intelligent trim takes time from the largest clips first, and a floor stops any clip being chopped below a watchable minimum — so short, crisp shots survive intact. "Equal" is the make-them-all-the-same alternative.
🎓 Course complete
You've followed a cricket highlight from air-pressure wobbles to a shareable reel: sampling, band-pass filtering,
loudness and adaptive thresholding, peak-picking, temporal context, strict scoring, active-learning curation, pose
& motion video confirmation, multi-signal fusion, layered noise reduction, and length-aware export — every bit of
it running offline on the phone in your pocket.
The throughline: prefer the simplest signal that does the job, keep every decision explainable, and never
throw the user's data away. That's how a phone watches cricket and cuts the highlights itself.
🏏
Want to see it on your own footage?Grab the app and run a net session through it — free, on your phone, no cloud.
Everything in this course happens in the few seconds after you tap analyse.