Auscult

Accuracy benchmark · v1

Auscult vs the human ear.

We asked 7 ASE-certified mechanics to diagnose 200 real engine faults from audio clips. Then we ran the same clips through Auscult's on-device Core ML model. The mechanics got a median of 73% right. Auscult got 81%.

81%

Auscult top-1 accuracy across the 200-clip test set.

73%

Median score across 7 ASE-certified mechanics listening blind.

+8pp

Auscult beats the human median — same clips, same taxonomy, same evaluator.

Methodology

Same clips. Same taxonomy. No retraining.

01

Dataset frozen before scoring.

200 recordings — 60% from our workshop partnership program (real faults recorded on iPhone at the bay), 30% from YouTube mechanic-channel clips with labelled root cause, 10% from AudioSet + DCASE2024 distractors. Every clip 16 kHz mono, 10-20 seconds, scrubbed of PII. If a clip was in training, it's excluded from this set.

02

Auscult run on-device.

Latest champion model from MLflow registry, inferred via the Core ML binding that ships in the iOS app. Top-1 prediction + confidence recorded for every clip. No fine-tuning on this set.

03

Seven mechanics, blind.

Each receives the 200 clips in randomised order via a private web viewer. They pick one fault family from a fixed 10-label taxonomy (9 fault classes + healthy). No retries, no backtracking, 24-hour window. €100 gift card + named attribution on the report.

04

Metrics published.

Top-1 accuracy, top-3 accuracy, full 10×10 confusion matrix with per-class precision/recall/F1, confidence calibration (ECE), and pairwise Cohen's κ + Fleiss' κ across the mechanic panel. Everything in the report PDF.

Dataset composition

200 clips, 10 classes, one distribution.

Composition was frozen at dataset v1 lock. Class ratios chosen to match the complaint distribution our workshop partners actually see, not a flat 20-each split.
Fault familyClipsShare
Wheel bearing4020.0%
Serpentine belt3015.0%
Engine knock2512.5%
Cylinder misfire2512.5%
CV joint2010.0%
Suspension / strut2010.0%
Brake pad grinding157.5%
Turbo whistle157.5%
Healthy (distractor)105.0%
Total200100%

Reproducibility

Inspect every clip, every call.

The dataset, the protocol and the raw scoring are public. Anyone can re-run the mechanic panel against different raters, replay Auscult on their own hardware, or challenge any of the labels. We made this deliberately auditable so the headline number carries weight.

  • 200-clip dataset with ground-truth labels (CC BY 4.0)
  • Protocol + rater portal source code
  • Per-clip Auscult prediction + confidence scores
  • Per-rater per-clip scoring (anonymised as P1–P7)
  • Analysis notebook producing every chart in the report

Disclosure

Mechanics on the panel were recruited from Auscult workshop partners. To surface any selection bias, a second blind panel recruited from r/MechanicAdvice (7 raters, no prior relationship with Auscult) scored a subset of 50 clips. Results were within 2 percentage points of the partner panel. Methodology and raw scoring for both panels are in the report.