Auscult

Benchmark de precisión · v1

Auscult vs el oído humano.

Pedimos a 7 mecánicos certificados que diagnosticaran 200 averías reales solo a partir de audio. Después pasamos los mismos clips por el modelo Core ML on-device de Auscult. La mediana de los mecánicos fue 73%. Auscult acertó el 81%.

81%

Auscult top-1 accuracy across the 200-clip test set.

73%

Median score across 7 ASE-certified mechanics listening blind.

+8pp

Auscult beats the human median — same clips, same taxonomy, same evaluator.

Methodology

Mismos clips. Misma taxonomía. Sin reentrenamiento.

01

Dataset frozen before scoring.

200 recordings — 60% from our workshop partnership program (real faults recorded on iPhone at the bay), 30% from YouTube mechanic-channel clips with labelled root cause, 10% from AudioSet + DCASE2024 distractors. Every clip 16 kHz mono, 10-20 seconds, scrubbed of PII. If a clip was in training, it's excluded from this set.

02

Auscult run on-device.

Latest champion model from MLflow registry, inferred via the Core ML binding that ships in the iOS app. Top-1 prediction + confidence recorded for every clip. No fine-tuning on this set.

03

Seven mechanics, blind.

Each receives the 200 clips in randomised order via a private web viewer. They pick one fault family from a fixed 10-label taxonomy (9 fault classes + healthy). No retries, no backtracking, 24-hour window. €100 gift card + named attribution on the report.

04

Metrics published.

Top-1 accuracy, top-3 accuracy, full 10×10 confusion matrix with per-class precision/recall/F1, confidence calibration (ECE), and pairwise Cohen's κ + Fleiss' κ across the mechanic panel. Everything in the report PDF.

Dataset composition

200 clips, 10 clases, una sola distribución.

Composition was frozen at dataset v1 lock. Class ratios chosen to match the complaint distribution our workshop partners actually see, not a flat 20-each split.
Fault familyClipsShare
Wheel bearing4020.0%
Serpentine belt3015.0%
Engine knock2512.5%
Cylinder misfire2512.5%
CV joint2010.0%
Suspension / strut2010.0%
Brake pad grinding157.5%
Turbo whistle157.5%
Healthy (distractor)105.0%
Total200100%

Reproducibility

Inspecciona cada clip, cada predicción.

The dataset, the protocol and the raw scoring are public. Anyone can re-run the mechanic panel against different raters, replay Auscult on their own hardware, or challenge any of the labels. We made this deliberately auditable so the headline number carries weight.

  • 200-clip dataset with ground-truth labels (CC BY 4.0)
  • Protocol + rater portal source code
  • Per-clip Auscult prediction + confidence scores
  • Per-rater per-clip scoring (anonymised as P1–P7)
  • Analysis notebook producing every chart in the report

Disclosure

Mechanics on the panel were recruited from Auscult workshop partners. To surface any selection bias, a second blind panel recruited from r/MechanicAdvice (7 raters, no prior relationship with Auscult) scored a subset of 50 clips. Results were within 2 percentage points of the partner panel. Methodology and raw scoring for both panels are in the report.