Accuracy benchmark · v1

Auscult vs the human ear.

We asked 7 ASE-certified mechanics to diagnose 200 real engine faults from audio clips. Then we ran the same clips through Auscult's on-device Core ML model. The mechanics got a median of 73% right. Auscult got 81%.

81%

Auscult top-1 accuracy across the 200-clip test set.

73%

Median score across 7 ASE-certified mechanics listening blind.

+8pp

Auscult beats the human median — same clips, same taxonomy, same evaluator.

Methodology

Same clips. Same taxonomy. No retraining.

Dataset frozen before scoring.

200 recordings — 60% from our workshop partnership program (real faults recorded on iPhone at the bay), 30% from YouTube mechanic-channel clips with labelled root cause, 10% from AudioSet + DCASE2024 distractors. Every clip 16 kHz mono, 10-20 seconds, scrubbed of PII. If a clip was in training, it's excluded from this set.

Auscult run on-device.

Latest champion model from MLflow registry, inferred via the Core ML binding that ships in the iOS app. Top-1 prediction + confidence recorded for every clip. No fine-tuning on this set.

Seven mechanics, blind.

Each receives the 200 clips in randomised order via a private web viewer. They pick one fault family from a fixed 10-label taxonomy (9 fault classes + healthy). No retries, no backtracking, 24-hour window. €100 gift card + named attribution on the report.

Metrics published.

Top-1 accuracy, top-3 accuracy, full 10×10 confusion matrix with per-class precision/recall/F1, confidence calibration (ECE), and pairwise Cohen's κ + Fleiss' κ across the mechanic panel. Everything in the report PDF.

Dataset composition

200 clips, 10 classes, one distribution.

Composition was frozen at dataset v1 lock. Class ratios chosen to match the complaint distribution our workshop partners actually see, not a flat 20-each split.

Fault family	Clips	Share
Wheel bearing	40	20.0%
Serpentine belt	30	15.0%
Engine knock	25	12.5%
Cylinder misfire	25	12.5%
CV joint	20	10.0%
Suspension / strut	20	10.0%
Brake pad grinding	15	7.5%
Turbo whistle	15	7.5%
Healthy (distractor)	10	5.0%
Total	200	100%

Reproducibility

Inspect every clip, every call.

The dataset, the protocol and the raw scoring are public. Anyone can re-run the mechanic panel against different raters, replay Auscult on their own hardware, or challenge any of the labels. We made this deliberately auditable so the headline number carries weight.

200-clip dataset with ground-truth labels (CC BY 4.0)
Protocol + rater portal source code
Per-clip Auscult prediction + confidence scores
Per-rater per-clip scoring (anonymised as P1–P7)
Analysis notebook producing every chart in the report

Download dataset Read the PDF report

Disclosure

Mechanics on the panel were recruited from Auscult workshop partners. To surface any selection bias, a second blind panel recruited from r/MechanicAdvice (7 raters, no prior relationship with Auscult) scored a subset of 50 clips. Results were within 2 percentage points of the partner panel. Methodology and raw scoring for both panels are in the report.