Pawlogue

The on-device model

This is the real model that runs inside the app, on your phone, with no server. Everything below is honest: real holdout numbers, the real classes, and the model running live on real clips right here in your browser.

← Back to the app

Data at a glance

We did not train on 124 clips total. That number is only the fine 8-class emotion dictionary, which is data-limited because the full labeled emotion database is gated by its authors. The detector, the affect head, and the whole dog side draw on far more.

datasets cataloged

~90 GB

audio pulled to disk

300k+

clips on disk (mostly dog)

5,728

clips: cat detector training

3,015

clips: dog detector training

124

open clips: fine emotion dictionary (the gated piece)

So the scale is real. The honest bottleneck is specifically labeled fine-emotion CAT audio, which barely exists in the open (the full Cat Sound V2 set, about 3,000 clips, is author-gated). Detector and dog data are plentiful. The full source list with licenses is on the science page.

Hear it on all 8 sounds

One real clip per class, run live through the exact on-device model as the page loads. Tap play to listen. The bars are the 8-class dictionary read. The "detector" line is a separate gate: it is meow-trained, so some loud non-meow clips (a defensive yowl, a fight) score low and would be held back as "not sure" in the app. The honest held-out accuracy per class is in the table below, these clips just let you hear the model run.

Loading the model and classifying 8 clips...

The 8 classes it knows

Cross-validated holdout (5-fold). Overall test accuracy 71.9%, macro-F1 0.72, versus a 21.9% guess baseline. Trained on 124 open clips, so this is the day-0 base that per-cat learning refines.

Class	What it means	F1	Clips

Dropped on purpose: Paining (pain). It scored too low to be honest AND pain is a clinical call we must never assert. Pain stays a "see a vet" nudge, never a translated verdict. Merged: Happy and Resting into Content (both calm-positive, they blended together).

What ships, and how big

How it was trained

Protocol: 80/10/10 train/validation/test (the dictionary uses 5-fold cross-validation because only 124 clips exist). Class-balanced batches, so no class is starved. Group-level splits where IDs exist, so the same recording never leaks across train and test. The test set never touched training.

Architecture: the cat-vs-not-cat detector is a small log-mel CNN. The dictionary and the calm-vs-distress affect heads are MFCC features plus logistic regression (the honest winner on small data). Trained from scratch, no giant pretrained backbone, to stay tiny and run fully offline.

Honesty check (parity): the feature math in the browser was verified to match the Python training pipeline (log-mel within 3.8e-4, MFCC within 7e-3). That is why the verdict you see in the app is the real model output, not an approximation.

Detector and affect

Cat vs not-cat detector~89 to 90% test

Affect (calm vs distress)70.5% test

Per-cat personalization (Tier 2)80% within-cat, validated

The detector is the gate: if it does not hear a clear cat, the app says "not sure" instead of guessing. Non-cat noise (a vacuum, a door) is rejected, not translated.

Honest limits

The full CatSoundV2 database (about 3000 clips) is gated by its authors and is not openly downloadable, so 124 clips is today's ceiling. With the full set, every class would likely pass 0.8 F1 and we could re-admit Paining and split Happy from Resting.

This base model is universal (works day one on any cat). The real magic is Tier-2: it learns YOUR cat's own meows, which pushes accuracy to ~80% for that cat. The base is the cold start, the personalization is the moat.

Confidence is always shown as a band (clean / most-likely / possibly / hard-to-tell), never a fake percentage, because a percentage on this data would not be honest.

← Back to the app