Benchmark de précision · v1
Auscult contre l'oreille humaine.
Nous avons demandé à 7 mécaniciens certifiés de diagnostiquer 200 pannes réelles à partir d'extraits audio. Puis nous avons fait passer les mêmes extraits par le modèle Core ML on-device d'Auscult. La médiane des mécaniciens : 73 %. Auscult : 81 %.
81%
Auscult top-1 accuracy across the 200-clip test set.
73%
Median score across 7 ASE-certified mechanics listening blind.
+8pp
Auscult beats the human median — same clips, same taxonomy, same evaluator.
Methodology
Mêmes extraits. Même taxonomie. Aucun réentraînement.
Dataset frozen before scoring.
200 recordings — 60% from our workshop partnership program (real faults recorded on iPhone at the bay), 30% from YouTube mechanic-channel clips with labelled root cause, 10% from AudioSet + DCASE2024 distractors. Every clip 16 kHz mono, 10-20 seconds, scrubbed of PII. If a clip was in training, it's excluded from this set.
Auscult run on-device.
Latest champion model from MLflow registry, inferred via the Core ML binding that ships in the iOS app. Top-1 prediction + confidence recorded for every clip. No fine-tuning on this set.
Seven mechanics, blind.
Each receives the 200 clips in randomised order via a private web viewer. They pick one fault family from a fixed 10-label taxonomy (9 fault classes + healthy). No retries, no backtracking, 24-hour window. €100 gift card + named attribution on the report.
Metrics published.
Top-1 accuracy, top-3 accuracy, full 10×10 confusion matrix with per-class precision/recall/F1, confidence calibration (ECE), and pairwise Cohen's κ + Fleiss' κ across the mechanic panel. Everything in the report PDF.
Dataset composition
200 extraits, 10 classes, une seule distribution.
| Fault family | Clips | Share |
|---|---|---|
| Wheel bearing | 40 | 20.0% |
| Serpentine belt | 30 | 15.0% |
| Engine knock | 25 | 12.5% |
| Cylinder misfire | 25 | 12.5% |
| CV joint | 20 | 10.0% |
| Suspension / strut | 20 | 10.0% |
| Brake pad grinding | 15 | 7.5% |
| Turbo whistle | 15 | 7.5% |
| Healthy (distractor) | 10 | 5.0% |
| Total | 200 | 100% |
Reproducibility
Inspecte chaque extrait, chaque prédiction.
The dataset, the protocol and the raw scoring are public. Anyone can re-run the mechanic panel against different raters, replay Auscult on their own hardware, or challenge any of the labels. We made this deliberately auditable so the headline number carries weight.
- 200-clip dataset with ground-truth labels (CC BY 4.0)
- Protocol + rater portal source code
- Per-clip Auscult prediction + confidence scores
- Per-rater per-clip scoring (anonymised as P1–P7)
- Analysis notebook producing every chart in the report
Disclosure
Mechanics on the panel were recruited from Auscult workshop partners. To surface any selection bias, a second blind panel recruited from r/MechanicAdvice (7 raters, no prior relationship with Auscult) scored a subset of 50 clips. Results were within 2 percentage points of the partner panel. Methodology and raw scoring for both panels are in the report.