No Meter Running
I moved face recognition in my family photo archive to local models because the cloud path was starting to make every experiment feel expensive.
Media items
Per item
One dev path
Inference
Local models are leverage when you can run them again and again without watching the meter.
The family photo archive started as a migration project: get years of photos and videos out of a hosted gallery and into a self-managed system without losing albums, originals, captions, dates, and the small bits of context that make old photos feel alive. Once that worked, the next obvious feature was people.
Not public face search. Not a social network tagging system. Just the family version of the question everyone eventually asks of a photo library: show me the beach photos with both daughters; find the Christmas album where everyone was around the table; help me build a book without manually opening fifteen years of folders.
The first version used AWS Rekognition because it was already close at hand and it proved the idea quickly. There is nothing wrong with that path. The problem was the meter. During development I spent more than $80 in token and API usage on one model path while I was still tuning. That was not catastrophic, but it changed how I worked. Every rerun felt like a small decision.
Local models changed the mood of the project. I could run the same album again, adjust a threshold, inspect what changed, and run it again. That loop is where the feature got good.
The Point
A lot of people try local models for agentic development, compare them to frontier cloud models, and come away disappointed. I get it. If the job is open-ended reasoning, writing, planning, coding, and tool use, the gap can be obvious.
But that is not the only way to use local models. In this project, the model was not the whole product. It had a simple job: find faces, turn each face into a numerical fingerprint, compare that fingerprint against known people, and return a structured answer.
The rest of the system did the careful work around it: measuring results, tuning thresholds, tracking where every decision came from, and deciding when a match was strong enough to publish automatically. That is the lesson I want to keep. Local models can be underwhelming as general replacements and very useful as specialized workers.
The models
I tried a few local face models before picking the production path. The table below is intentionally plain-English: who made it and how I used it.
| Local model | Maker / link | Settings I used |
|---|---|---|
InsightFace buffalo_l |
DeepInsight / InsightFace | Two detector passes: 640px at 0.5 confidence, then 960px at 0.4 confidence. Recognition input is 112x112 and outputs a 512-number face fingerprint. |
| OpenCV YuNet | OpenCV Zoo, based on YuNet/libfacedetection work | face_detection_yunet_2023mar.onnx, 640x640 input, score threshold 0.6, NMS threshold 0.3. |
| OpenCV SFace | OpenCV Zoo | face_recognition_sface_2021dec.onnx, 112x112 face crop input, 128-number face fingerprint output. |
| MediaPipe BlazeFace short range | Google AI Edge / MediaPipe | blaze_face_short_range.tflite, detection only, default-style confidence filtering. |
The Simple Architecture
The working system is easier to explain without the implementation names:
Use the smaller web image for speed unless the album needs extra detail.
Run InsightFace twice: once precise, once with higher recall for small or harder faces.
Turn each face into a list of numbers that can be compared.
Compare the new fingerprint against approved examples for each person.
Strong matches publish automatically. Weak matches stay out of search.
The family sees normal photo pages, not model output.
Why The Harness Mattered
The best decision was not picking a model. It was keeping the model behind a contract. A model can find faces and suggest matches, but the gallery decides what counts as a real person tag.
That made the work reversible. I could swap providers, rerun an album, compare results, and keep the public gallery stable. It also made automatic matching much safer, because the model never got to publish a guess just because it had a score.
Making It Automatic
At first I assumed a human would need to approve most matches. That would have worked for a tiny album, but not for tens of thousands of photos. The better question was: which matches are strong enough that I should not have to look at them?
To answer that, I used albums that already had reviewed labels. The local pipeline matched those albums without looking at the answers, then compared its guesses against the known labels. I swept the match score, the gap between the best and second-best person, the face detection confidence, and the minimum face size.
The strict profile selected from prepared calibration assets hit 100% auto-approval precision on the validation set: 117 correct automatic approvals, 0 false automatic approvals, and 0 unknown-person automatic approvals. A looser audit profile found more matches: 131 automatic approvals, 128 correct, 3 false, and still 0 unknown-person automatic approvals.
That gave me two lanes. Strong matches go straight into the archive. Audit matches can be useful right away, but they stay easy to review. Weak matches stay out.
Tuning The Detector
The first local runs missed too many small faces. The obvious fix would be to lower the detection threshold and hope for the best. That found more faces, but it also created more junk.
The better fix was a two-stage pass. First, run a normal detector at det_size=640 and det_thresh=0.5. Then run a higher-recall detector at det_size=960 and det_thresh=0.4. Keep the normal results. Rescue extra faces from the second pass only if they clear confidence, face-size, and overlap checks.
That made the model more sensitive without making it sloppy. If an odd result shows up later, the sidecar says whether it came from the normal pass or the rescue pass.
How Fast It Was
My production logs currently measure full runs rather than every internal sub-step. So the honest number is end-to-end: download or read the media, decode it, find faces, make fingerprints, compare them, write sidecars, refresh search, and verify the publish.
| Measurement | What it includes | Average |
|---|---|---|
| Production, per media item | Full local pipeline on year-scale runs | 1.12 seconds per photo/video entry |
| Production, per detected face | Same full pipeline, divided by detected faces | 0.84 seconds per face |
| Typical year-scale run | Average of 2007, 2008, 2009, and 2013 slices | 23.6 minutes for about 1,263 entries and 1,679 faces |
| Early bakeoff: InsightFace | 80-entry sample, one local candidate run | 58.1 seconds total, about 0.73 seconds per entry |
| Early bakeoff: YuNet + SFace | 80-entry sample, one local candidate run | 32.0 seconds total, about 0.40 seconds per entry |
| Early bakeoff: MediaPipe detector | 80-entry sample, detection only | 1.9 seconds total, about 0.02 seconds per entry |
The exact seconds matter less than the shape of the loop. A year-sized run taking about twenty-five minutes means I can tune, rerun, and inspect without a cloud bill shaping every decision.
What This Ran On
These numbers came from my local MacBook Pro: Apple M3 Max, 16 CPU cores with 12 performance cores and 4 efficiency cores, a 40-core GPU, and 64 GB of memory.
I did not capture CPU or GPU utilization in these run logs. The production InsightFace path was configured through ONNX Runtime's CPUExecutionProvider, so the timings above should be read as CPU-path timings on this Mac, not as a GPU benchmark.
What Changed
The local path did not win because it was more glamorous. It won because it made the development loop cheap and repeatable. The model could be tested. The thresholds could be tuned. The automatic approvals could be checked against known answers. The public gallery could stay simple.
That is the bigger lesson for me. Local models do not need to replace frontier models at everything. They need a job they can do, a harness that measures them, and a product that knows when to trust them. When those pieces line up, local stops feeling like a compromise and starts feeling like leverage.