Data Quality Screening at Scale
Automated quality assessment across 1,000+ anthropometric surveys for UNICEF JME
Challenge
The UNICEF/WHO/World Bank Joint Malnutrition Estimates (JME) maintain a global database of anthropometric surveys used to track child nutrition outcomes worldwide. Before any survey enters the published estimates, it must pass quality review. But with over 1,000 surveys in the database and new rounds arriving regularly, manual review doesn’t scale.
The existing process relied on ad-hoc checks and expert judgement, making it difficult to maintain consistency across reviewers and across time. Surveys could have subtle data quality issues (digit preference in measurements, age heaping, implausible z-score distributions, or internally incoherent prevalence estimates) that are hard to catch without systematic screening. The JME team needed an automated first-pass quality assessment that could flag surveys for human review, with transparent evidence rather than a black-box score.
Approach
A3DI designed and built a multi-channel statistical screening pipeline that computes 136 quality indicators per survey across 10 dimensions, then scores each survey through 5 independent statistical channels. Each channel answers a different question about data quality:
The indicator pipeline handles the full measurement quality surface: completeness, sex ratio deviations, age heaping, digit preference in height and weight measurements, WHO flag rates, z-score standard deviations, distribution shape (skewness, kurtosis), whole-number heaping, age-specific quality differences between boys and girls, and z-score tail behaviour.
Channels A–C assess process quality (how data was collected), while Channels E–F assess output quality (whether the resulting prevalence estimates are epidemiologically plausible). This separation matters: a survey can have clean measurements but implausible prevalence due to sampling problems, or vice versa.
The mixed-effects models (Channels C and E) use hierarchical nesting by WHO × World Bank region (17 groups) with random intercepts per country, so that single-survey countries borrow strength from their region while countries with long survey histories are evaluated against their own trajectory.
All fitted model parameters are persisted as Parquet files, which are language-agnostic and readable by Python, R, and Stata. New surveys can be scored in seconds against the existing reference distribution without refitting.
Result
The system processes 1,091 surveys and assigns tiered review priorities: roughly 4% are Tier 1 (flagged by 3+ channels, requiring full expert review), 9% are Tier 2 (2 channels), and 17% are Tier 3 (1 channel). The remaining 70% require no review.
Validation against 134 historically rejected surveys (excluded from model training and scored out-of-sample) showed that 85% of rejected surveys were flagged by at least one channel (2.8× the baseline rate), and 50% were classified as Tier 1, roughly 12× the baseline. The system captures real quality differences, not noise.
The pipeline was built in Python as a proof-of-concept and is being migrated to R
for integration with the JME team’s existing Stata-based production workflow.
The final deliverable is a single R script callable from Stata
(Rscript score_survey.R <csv_path>) that scores a new survey
in seconds against pre-fitted models, with no Python dependency required on the
production server.
Key Takeaway
At the scale of global monitoring, manual data quality review doesn’t scale. But automated screening only earns trust when it produces structured evidence for human reviewers, not binary accept/reject decisions. A multi-channel architecture in which each method answers a different question gives reviewers the diagnostic decomposition they need to act on the flags.