UNICEF · Joint Malnutrition Estimates

Data quality screening at global scale

An automated first pass over the surveys behind the world’s child-malnutrition estimates — 131 statistical indicators distilled into one calibrated score, so expert reviewers spend their time only where the evidence points.

90%

of surveys clear automatically — zero manual review

0.87

out-of-sample AUC against historically rejected surveys

5×

concentration of known-bad surveys in the review queue vs random

10⁻¹⁴

Python–R agreement on every channel score, enforced as a release gate

Client

UNICEF / WHO / World Bank JME

Role

Design & build

Stack

Python (model build) · R (production scoring)

Scale

1,091 surveys · 167 countries

Timeline

Dec 2025–Jul 2026 · delivered

At a glance

The problem

Every survey feeding the global child-malnutrition estimates needs a quality check. With 1,000+ in the database and more each round, manual expert review can’t keep pace — or stay consistent between reviewers.

What I built

A configuration-driven screening system computing 131 documented quality indicators per survey, scored through 5 independent statistical channels and combined into one calibrated composite — every flag traceable to named indicators.

The outcome

90% of surveys clear automatically; the 10% review queue concentrates known-bad surveys at ~5× the random rate (AUC 0.87 out-of-sample). Endorsed by the WHO/UNICEF/World Bank JME partners in April 2026.

01 Challenge

Manual review doesn’t scale to a global database

The UNICEF/WHO/World Bank Joint Malnutrition Estimates maintain the global database of anthropometric surveys used to track child nutrition worldwide. Before a survey enters the published estimates, it must pass quality review — but with 1,000+ surveys and new rounds arriving constantly, hand review breaks down.

The existing process leaned on ad-hoc checks and expert judgement, making consistency hard to hold across reviewers and across years. Subtle issues — digit preference in measurements, age heaping, implausible z-score distributions, internally incoherent prevalence — are easy to miss without systematic screening.

The real requirement

An automated first pass that flags surveys with transparent evidence a human can act on — not a binary accept/reject machine that reviewers would never trust.

02 Approach

131 indicators, five channels, one calibrated score

The pipeline computes 131 documented quality indicators per survey across 10 dimensions — age heaping and terminal-digit preference measured by index of dissimilarity, sex-ratio deviation against the demographic reference, WHO implausible-value flags, z-score dispersion, skewness and tail asymmetry. Fifteen representative indicators, selected one per dimension on effect size, feed five independent statistical channels — each answering a different question about quality, so a flag always comes with a reason.

Input

1,091 surveys

1980–2024 · 167 countries

Measure

131 indicators

10 quality dimensions

Score

5 channels

severities, percentile-ranked

Combine

1 composite score

flag at the 90th percentile

Process

PCA composite

Global quality rank — PC1 carries 28.5% of variance

Process

Mahalanobis

Unusual indicator combinations, regularised covariance

Process

Mixed-effects

Unusual against the country’s own history

Output

Prevalence outlier

Off the country’s epidemiological trajectory

Output

Prevalence coherence

Estimates implausible taken together

Why it earns a statistician’s trust

Channels A–C judge how the data was collected; E–F judge whether the resulting estimates are plausible. Every model is fitted only on the 868 surveys the JME partners had already accepted — a one-class design that never trains on rejection labels. The mixed-effects channels nest countries in 17 WHO × World Bank regions — chosen over the alternatives by AIC — with per-country random intercepts, so single-survey countries borrow strength from their region while long-history countries are judged against their own trajectory. Channel severities are percentile-ranked against the accepted reference and combined by an equal-weight rank-mean, a rule that beat a naive count of flagging channels by +0.11 AUC in a formal tuning experiment.

03 Result

Ninety per cent clears — the queue that remains is dense with real problems

Where 866 scored surveys land

flag = composite score above the 90th percentile of accepted surveys

90%No review needed

10%

90% — Clear. Below the calibrated threshold; no reviewer time required.

10% — Review. 87 surveys, each flagged with per-channel severities and named indicators as evidence.

The 87 flags are spread across all five channels as dominant driver — the PCA composite leads 32, country mixed-effects 20, prevalence outlier 16, Mahalanobis 11, coherence 8 — evidence the channels measure genuinely different failure modes. And the score independently recovers a known fact: flag rates fall from ~17% for surveys from the 1980s–90s to ~5.5% over the last two decades, tracking the real secular improvement in survey quality.

Held to a supervised standard

AUC discriminating historically rejected surveys · cross-fitted (k = 5, 10 × 3 seeds), country-clustered bootstrap

Naive combination — count of channels flagging0.76

AUC 0.76

Shipped composite — equal-weight rank-mean0.87 within 0.02 of the supervised ceiling

AUC 0.87

The combination rule was chosen the way a supervised model would be — candidate rules compared by cross-fitted AUC with a country-clustered bootstrap — yet the shipped model never sees a rejection label. A supervised benchmark trained directly on those labels reaches 0.885; the one-class composite gives up just 0.02 of that. On held-out evidence, 84% of historically rejected surveys are flagged by at least one channel, and a review queue capped at 10% of the database holds roughly 5× more rejected surveys than random selection. In April 2026 the WHO/UNICEF/World Bank JME partners reviewed the methodology and endorsed it for the global database — “a screening tool, not a decision tool.”

Correct twice, in two languages

Every scorer is implemented twice — Python for model building, R for production scoring inside UNICEF’s pipeline — and the implementations agree to ~10⁻¹⁴ on channel scores, enforced as a release gate alongside 123 Python and 22 R test suites. The frozen model bundle (24 files, ~0.5 MB, Parquet) scores a new survey offline: no Python, no corpus, no refit.

production · offline R module inside the JME pipeline

# score any survey against the frozen model bundle
$ Rscript score_survey.R Albania-2008-DHS-ANT.dta
→ 131 indicators · 5 channel severities · 1 composite score
⚑ C1 0.908  above the 0.771 threshold → review, evidence workbook written
$ Rscript score_survey.R Turkmenistan-2024-MICS-ANT.dta
✓ C1 0.536  below threshold → clear

04 Production

From validated model to a stage in UNICEF’s own pipeline

Delivery didn’t end at a validated score. In July 2026 the scoring module was vendored into UNICEF’s anthropometric data-processing pipeline as a native batch stage — code their team versions, runs and maintains alongside the pipeline it serves. It reads the pipeline’s prepped Stata files directly and scores each survey against the frozen model bundle, fully offline: no Python, no network, no refit, no author in the loop.

Input

Prepped survey

Stata .dta, straight from the pipeline’s own prep stage

Score

Frozen bundle

24 Parquet files · 0.5 MB · offline

Evidence

Per-survey outputs

verdict CSV · Excel workbook · PDF dashboard

Lives in their repo

The R module and model bundle ship inside UNICEF’s pipeline repository — versioned, reviewed and released with the code that calls them, not maintained alongside it

Evidence workbook

Every verdict carries an Excel workbook: each individual indicator set against the accepted-corpus median and 90th percentile, with a plain-language codebook

Reported-only screen

Surveys with no microdata at all are still screened — reported estimates run through both prevalence channels, with per-channel verdicts and a trend chart

Designed for the operator, not the demo

A batch pipeline fails at the edges, so the module guards them. A survey from a country in the training corpus scores normally. A real country with no training history — Canada, whose only surveys couldn’t be scored — auto-passes and scores against its region. A name the system has never seen hard-stops with the nearest known keys, because a typo that scores silently is worse than a run that stops.

The engagement closed with the artefacts that keep the system running without me: a refit runbook with a pre-flight checklist, a partner-facing methodology document, the WHO evidence pack behind every design decision, and an end-to-end harness proving the Python and R implementations agree to 10⁻¹⁴ on reference surveys. Refits stay in Python; scoring runs in R inside UNICEF’s pipeline — and the model reviewers see is exactly the model that was validated.

The takeaway

Reviewers don’t trust a score — they trust evidence. And a system isn’t delivered until it runs in the client’s own pipeline, on their files, without its author in the loop: handed over, not just handed in.

Have a workflow that’s outgrown manual review?

I build reproducible pipelines that turn ad-hoc, expert-judgement processes into systems your team can trust and maintain.

aaron@a3di.dev → See more work