A founder I was working with had built a care gap identification model for a Medicaid population. The ambition was right: use a mix of claims and clinical data to surface members most likely to have unaddressed care gaps across chronic conditions. The data pipeline was pulling in everything available: medical claims, pharmacy claims, lab results, ADT feeds, social determinants screening responses, prior authorization histories, and whatever clinical data the health plan could get from its provider network.
The model was underperforming. Not catastrophically, but consistently below the threshold the health plan needed to justify operationalizing it. The team had been iterating on model architecture for two months, running feature importance analyses, trying different training windows, adjusting for population imbalance. The model kept plateauing.
On a call where we walked through the data pipeline in detail, something became clear. The clinical data coming from the provider network was inconsistent in ways that were actively hurting model performance. Lab results were present for roughly 35% of the attributed population, with significant variation in which tests were recorded, how frequently, and with what degree of completeness. Social determinants data had been collected through a screening program that covered less than 20% of members, and the screening methodology had changed twice in the prior 18 months.
The model was treating absence of data as a signal. In some cases it was. In most cases it was just absence, driven by care access patterns, documentation variability, and screening program gaps that had nothing to do with the member’s actual health status.
The team removed the inconsistent clinical data sources and rebuilt the model on claims alone, supplemented only by the lab results where coverage exceeded 80% of the population. Model performance improved meaningfully within two weeks. Not because claims data is better than clinical data in some abstract sense. Because consistent, high-coverage data outperforms broad, inconsistent data for a population-level prediction task every time.
Small datasets, correctly chosen, outperform massive messy ones. That is not a workaround. It is a principle worth building your data strategy around.
Why More Data Is the Default Assumption
The instinct to collect more data before building an AI product is understandable. More data means more signal. More signal means better models. More features mean more explanatory power. The logic sounds right.
In mature, well-governed data environments with consistent collection methodology and high population coverage, it is roughly right. In healthcare, it is almost never the environment you are actually working in.
Healthcare data is not abundant in the way that consumer or transactional data is abundant. It is voluminous but inconsistent. Claims data is structurally consistent but limited to what was billed. Clinical data is richer in theory but governed by documentation culture, EHR configuration, care access patterns, and provider network participation in ways that make coverage and completeness deeply uneven across a population. Unstructured data is clinically rich but requires significant processing to be usable, and that processing introduces its own inconsistencies.
When you pull all of it together and feed it into a model without assessing coverage, consistency, and collection methodology for each source, you are not enriching your model. You are introducing noise at a scale that is hard to detect until the model starts behaving unexpectedly in production.
The minimal data principle is the corrective. It is not an argument for using less data because less is philosophically better. It is an argument for using the right data, defined precisely, assessed rigorously, and chosen because it is consistently available, reliably collected, and directly relevant to the clinical or operational question the model is trying to answer.
The Framework: Five Steps to Identifying Your Critical Data Endpoints
Step 1: Define the clinical question with precision before touching the data.
This sounds obvious. It is consistently skipped. A care gap identification model is not a clinical question. “Which members in this Medicaid population are most likely to have an unaddressed HbA1c gap in the next 90 days, given their claims history and current medication fills” is a clinical question. The more precisely you define the question, the more clearly the required data endpoints reveal themselves, and the more clearly the irrelevant ones do too.
For each AI use case, write the clinical or operational question in one sentence. If you cannot do that, the use case is not ready for a data strategy conversation.
Step 2: Map the minimum data required to answer that question reliably.
Starting from the precisely defined question, work backward to the data endpoints that are genuinely necessary for the model to produce a useful output. Not the data that would be nice to have. Not the data that might add marginal explanatory power. The data without which the question cannot be answered at all.
For a readmission risk model, that might be prior admission history, primary diagnosis, discharge disposition, and medication reconciliation status at discharge. For a care gap identification model built on claims, it might be procedure codes, diagnosis codes, pharmacy fills, and enrollment continuity. The list should be short enough that you can write it on one page.
Step 3: Assess coverage and consistency for each required endpoint.
For every data endpoint on your minimum required list, answer three questions: what percentage of the target population has this data available, how consistently is it collected across the facilities, payers, or providers in your data set, and has the collection methodology been stable over the training window you plan to use.
Any endpoint that fails on coverage below a threshold relevant to your use case, that varies significantly in collection methodology across your population, or that changed substantially during your training window is a candidate for exclusion or special handling, regardless of its theoretical value to the model.
This is the step that the opening scenario’s team had skipped. The clinical data sources were theoretically valuable. Their coverage and consistency profile made them practically harmful.
Step 4: Cut aggressively, then validate.
Remove the data endpoints that failed the coverage and consistency assessment. Rebuild or retrain the model on the reduced, validated dataset. Measure performance against the version that included the noisy sources.
This step requires intellectual honesty that is harder than it sounds. Teams that spent months acquiring a data source have a strong psychological incentive to keep it in the model even when the evidence suggests it is hurting performance. The discipline of cutting aggressively and validating the result objectively is what separates teams that build reliable models from teams that spend six months tuning a model that was never going to work because the data underneath it was never ready.
Step 5: Define the minimum coverage threshold for production deployment.
Before you move from development to production, define the minimum data coverage level at which the model is allowed to run. If your care gap model requires pharmacy fill data for reliable performance, and a new health system’s pharmacy claims feed has a 60-day lag during the first month of implementation, the model should not run on that population until the lag is resolved.
This threshold is not a technical parameter. It is a clinical governance decision. It should be made explicitly, documented, and enforced operationally. Models that run on data that falls below their validated coverage threshold produce outputs that cannot be trusted, and clinical staff who receive untrustworthy outputs once tend not to trust the model again even after the data quality issue is resolved.
Where the Regulatory Layer Fits
There is a version of the over-collection problem that is driven not by ambition but by regulatory caution. Teams building AI products in healthcare sometimes collect more data than they need because they are uncertain about what a future regulatory requirement might demand, and they want to have it available just in case.
This logic has some validity. The regulatory environment around clinical AI is evolving, and requirements around model explainability, bias documentation, and training data provenance are likely to become more prescriptive over time. But collecting data beyond what is needed for the current use case, under a logic of regulatory optionality, creates its own problems: expanded PHI exposure, increased data governance burden, larger attack surface for security incidents, and a data environment that is harder to audit and explain than one that was scoped deliberately.
The minimal data principle is not in tension with regulatory compliance. It supports it. A model built on a precisely defined, well-documented set of data endpoints with clear coverage and consistency assessments is easier to explain to a regulator than one built on everything available. Explainability, provenance, and bias documentation all become more tractable when the data scope is narrow and deliberate.
Red flag: If your data acquisition strategy is being driven by “we might need it later” rather than “we need it for this specific use case,” you are building governance debt, not optionality.
What This Looks Like in Practice: A Quick Reference
For founders scoping a clinical AI product:
- Write the clinical question in one sentence before touching data infrastructure.
- List the minimum data endpoints required to answer it. If the list exceeds one page, the use case is under-defined.
- Get coverage and consistency numbers for each endpoint in your target health system’s production environment, not a curated dataset.
- Remove anything below your coverage threshold before training. Validate that removal improved or maintained performance.
- Define the minimum coverage threshold for production deployment and build an enforcement mechanism for it.
For enterprise data and AI teams inside health systems:
- Audit your existing AI and analytics use cases for data sources that are below acceptable coverage thresholds. Most mature deployments are running on at least one.
- Establish a data quality standard for each model in production, including coverage thresholds, consistency requirements, and a monitoring process that flags degradation.
- Resist the organizational pressure to add data sources to a model that is underperforming. Diagnose the coverage and consistency of existing sources first.
- Treat data minimization as a governance asset, not a compromise. Smaller, cleaner, well-documented data scopes are easier to govern, easier to audit, and easier to defend.
Closing
The assumption that more data produces better AI is a reasonable prior in most industries. In healthcare, it is a trap.
Healthcare data is voluminous, inconsistent, unevenly distributed, and shaped by access patterns, documentation culture, and operational variability that have nothing to do with the clinical reality the model is trying to capture. Adding more of it to a model without assessing its coverage and consistency does not enrich the signal. It amplifies the noise.
The minimal data principle is not a constraint imposed by resource limitations. It is a discipline imposed by the realities of the environment. The teams that build reliable clinical AI products are the ones that define precisely what they need, assess rigorously whether they have it, cut aggressively what fails that assessment, and resist the temptation to acquire more before validating what they already have.
Small datasets, correctly chosen, outperform massive messy ones. In healthcare AI, that is not a consolation for limited data access. It is the strategy.

