Aashna Shah

Redefining Normal: Learning Patient-Specific Baselines in Clinical Medicine

**0.1 The average man **

For much of Western history, the term normal belonged to mathematics, referring to right angles or properties of distributions. The idea that the human body might conform to mathematical proportions remained largely metaphorical. This changed in the nineteenth century with Adolphe Quetelet, who applied the Gaussian error curve to human traits and introduced l’homme moyen — the “average man.” This was not merely a statistical abstraction, but an ideal type: the average was treated as the standard to which individuals should conform, and variation across the population was interpreted as deviation from that norm. In this way, what is average came to define what is expected of a healthy person.

This logic remains embedded in modern clinical practice. Reference ranges, diagnostic thresholds, and risk scores are largely derived from aggregated population data. Patients are evaluated by comparison to these distributions, with normality defined by proximity to a population mean or percentile. These definitions underpin clinical decision-making, shaping diagnoses, treatments, referrals, insurance determinations, and functional assessments such as work eligibility.

A major critique of this logic emerged from Georges Canguilhem, who argued that normality is not a statistical property of populations but of individual organisms. For Canguilhem, health is not simply the absence of disease, but the capacity to adapt to perturbations and establish new functional norms. Under this view, an individual may be statistically typical yet biologically pathological, or statistically atypical yet fully healthy. Population averages, therefore, provide limited insight into the condition of any given individual.

Recent advances in artificial intelligence enable a different approach. The increasing availability of clinical data combined with advances in machine learning, makes it possible to support more individualized representations of normality. However, these methods depend on the data and assumptions used to train them, and may therefore inherit biases and structural limitations.

This dissertation revisits the tension articulated by Quetelet and Canguilhem in the context of modern AI. Clinical medicine has largely operationalized normality at the population level, while contemporary approaches seek to account for individual variation, with their own limitations. This thesis engages both perspectives within specific clinical contexts and demonstrates how redefining normal at the individual level can improve clinical decision-making.

0.2 Continuous to categorical thresholds

Clinical measurements are inherently continuous: body temperature, blood pressure, and anthropometrics vary along a spectrum; yet clinical decisions are categorical: treat or wait, refer or reassure, approve or deny. Medicine bridges this gap through thresholds, which convert continuous variation into discrete actions.

In practice, these thresholds reflect two often conflated concepts. One defines normality in terms of risk: cutoffs beyond which adverse outcomes become more likely. The other defines normality in terms of statistical typicality: values that fall within the range observed in a “healthy” population. As these concepts were formalized, researchers recognized that disease varies across groups and used these differences to stratify risk, while also observing that measurements of health vary within groups and adopting stratified thresholds of “normal.” In both cases, the choice of how to stratify carries consequences: it can capture meaningful biological variation, but it can also normalize lower levels of health.

The origins of this problem can trace to the nineteenth century. In the 1840s, John Hutchinson introduced the spirometer and demonstrated systematic relationships between lung capacity and characteristics such as height, age, occupation, and race. Within a decade, Samuel Cartwright, a slaveholding plantation physician, used the same instrument on enslaved Black individuals and interpreted lower measured volumes not as observation but as evidence of inherent physiological inferiority, using the finding to justify slavery. The shift from documenting variation to encoding it as innate difference was rapid, and not limited to spirometry. Similar patterns emerged in kidney function, cardiovascular risk, and pain management, as observed differences between groups were incorporated into clinical decision rules that systematically altered access to care.

These practices have come under increasing scrutiny, particularly where group-level variables such as race are used as inputs. Race is a social category, and its role in clinical equations depends on what unmeasured factors it proxies. In some cases, replacing race with direct measures of physiology improves performance. The CKD-EPI equation for estimating kidney function, for example, removed its race coefficient and incorporated cystatin C, a biomarker more directly related to kidney function, improving detection of impairment in populations previously affected by delayed diagnosis. In other cases, substitution introduces new limitations. The AHA PREVENT equations for cardiovascular risk replaced race with a neighborhood-level deprivation index. While this improves calibration at the population level, it can misclassify groups whose risk is not well captured by socioeconomic geography, such as South Asians in the United States.

The underlying problem is the same in both cases: group-level variables stand in for factors that are not directly measured, and what is typical for a population remains an imperfect guide to what is normal for a given patient. Risk prediction equations can be validated against measurable outcomes. Reference equations, by contrast, lack an independent gold standard. For example, There is no external measure of what a given person’s lung function should be.

In Chapter of this thesis, I introduce a framework for decomposing group-level differences in clinical reference equations into measurable individual characteristics, and demonstrate its utility in spirometry, the domain in which this logic first took shape. Using large observational datasets, we show that a substantial share of the variation traditionally attributed to race can be accounted for by individual-level features, and that models incorporating these features outperform both race-stratified and race-neutral alternatives. The thresholds that convert continuous measurements into clinical decisions depend on how normal is defined; this work aims to move that definition closer to the individual.

0.3 Perils and Promises of Personalization

Population definitions of normal reflect Quetelet’s philosophy, but typicality, even when stratified by ostensibly similar individuals, remains an imperfect proxy for individual health. The alternative, that normality is a property of the individual rather than the population, has gained traction far beyond the philosophical tradition in which Canguilhem first articulated it.

Comparing an individual to themselves requires repeated observation. The proliferation of electronic health records, consumer devices, and direct-to-consumer health services has made such observation routine. Within the clinic, blood panels, blood pressure, and body weight now accumulate across years, forming individual trajectories. Outside the clinic, wearables track heart rate and sleep, glucose monitors yield continuous metabolic data, and services offering biomarker panels and whole-body imaging have made periodic screening accessible to healthy individuals. For a growing number of people, a personal history sufficient to define an individual baseline already exists.

This infrastructure supports a fundamentally different clinical posture: define what is normal for an individual using measurements taken during periods of stability, then detect deviations from that baseline as they emerge. Health, under this framing, is not the absence of disease as judged by population standards, but the maintenance of one’s own equilibrium. The orientation shifts from reactive to preventive, from treating disease after it manifests to monitoring for the transitions that precede it.

The appeal of this logic is clear but narrowing the reference window to the individual introduces a complementary problem. The same sensitivity that detects early deviation from baseline also detects noise and lacks the specificity of deviation from population. The more frequently a patient is measured, the greater the probability that normal physiological fluctuation will be mistaken for meaningful change. This is, at its core, a multiple testing problem:. a routine blood panel may include 20 or more analytes, each compared to its own reference interval. Even if each test has only a 5 percent chance of producing a false alarm, the probability that at least one result will appear abnormal in a perfectly healthy person exceeds 60 percent. Add more measurements, repeat them over time, and the rate compounds. What seems like a precise window into individual health becomes a generator of noise. In radiology, incidental findings on imaging are so common that they gave rise to the term incidentaloma. Kohane and colleagues extended the concept to genomics as the incidentalome: the accumulation of findings generated by high-throughput screening, most of which are clinically irrelevant. The logic applies equally to longitudinal biomarker monitoring, consumer self-tracking, and direct-to-consumer screening. Optimizing for sensitivity alone, without population-level anchoring, trades one failure mode for another: missed change for false alarm.

Blood-based laboratory tests illustrate both the appeal and the difficulty of this tradeoff. Routine blood panels are among the most clinically consequential measurements in modern medicine, and in an era where chronic metabolic disease accounts for a growing share of mortality, their role in early detection is difficult to overstate. Yet their interpretation remains largely population-based: a result is flagged as normal, high, or low by comparison to a reference interval derived from the central 95 percent of a healthy population. For many analytes, however, within-individual variation is far narrower than between-individual variation. A patient’s own values may fluctuate within a tight range over years, while the population interval, constructed to contain the variation across all individuals, spans a much wider band. The consequence is that a clinically meaningful shift from a patient’s personal baseline can occur entirely within the bounds of what the population considers normal.

The resolution is not to choose between population and individual, but to integrate both. The resolution lies not in choosing between population and individual but in combining them. Population models provide context, prior expectation, and specificity. Individual trajectories provide sensitivity. In Chapter 2, I argue that neither alone is sufficient and examine the consequences of unconstrained personalization in laboratory medicine, quantifying the extent to which purely individual-derived intervals overcorrect and inflate false positives, and introduce an approach that anchors individual trajectories to population-level expectations. Neither source of information alone achieves the clinical sensitivity and specificity that their combination provides.

0.4 Hidden Dimensions

The previous sections established that population-based definitions of normality are both statistically insufficient and ethically fraught. The question is whether computation can do better. Over the past two decades, machine learning has been increasingly applied to clinical prediction, and its trajectory suggests a progressive capacity to model individual variation. But this capacity comes with a structural problem: the same machinery that enables personalization can encode and perpetuate the biases of the systems that generated its training data. Personalization and bias are not separate concerns. They arise from the same mechanism.

The earliest clinical prediction models operated within the same framework as the reference equations discussed above. Deep learning changed what could be modeled. Recurrent neural networks and, more recently, transformer architectures applied to electronic health record data made it possible to represent individual patient trajectories as sequences — to learn temporal patterns from longitudinal records rather than collapsing them into single-timepoint summaries. For the first time, a model could estimate what is expected for a specific patient given their own prior measurements, diagnoses, and treatments, rather than given their membership in a demographic group. This is the technical landscape within which the contributions of Chapters 2 and 3 operate: models that learn individual baselines from sequential clinical data.

Foundation models trained on clinical data extended this further. Systems such as CLMBR and Med-BERT use self-supervised pretraining on large longitudinal clinical corpora to learn representations conditioned on thousands of implicit covariates — not just age, sex, and height, but the full trajectory of a patient’s interactions with the healthcare system. In principle, these models approximate something close to the Canguilhem intuition: expected values derived from the dynamics of an individual system rather than from a population average. In practice, however, the data on which these models are trained is not a neutral record of biology. It is a record of clinical decisions, institutional practices, and systemic patterns of access and treatment.

This is where the paradox becomes concrete. Obermeyer et al. (2019) demonstrated the problem in a system that predates deep learning but illustrates the principle clearly. A widely deployed commercial algorithm used healthcare costs as a proxy for health need when allocating patients to care management programs. Because structural barriers — insurance coverage, geographic access, historical disinvestment — systematically reduce healthcare spending for Black patients relative to their disease burden, the algorithm learned to underestimate their needs. The model was not malfunctioning; it had faithfully learned the patterns in its training data. But those patterns encoded structural racism, and the algorithm reproduced it as a prediction of individual need. The distinction the model could not make — between “expected given biology” and “expected given differential care” — is precisely the distinction that individualized norms are supposed to capture.

The problem deepens with more expressive models. Gichoya et al. (2022) showed that deep learning models can predict patient race from chest X-rays with greater than 90% accuracy — from images that human radiologists cannot reliably use to determine race. Similar findings have demonstrated that models can recover sex and age from retinal images, and socioeconomic status from patterns in EHR event sequences. These results mean that even when demographic variables are withheld from model inputs, high-dimensional models can reconstruct them from the data itself. The implications are significant. In the systems discussed in Section 0.3 — spirometry equations, eGFR formulas — demographic covariates were at least explicit and therefore debatable. Braun (2014) could critique the race correction in spirometry because it was visible in the equation. When a foundation model or deep neural network implicitly recovers group membership from raw clinical or imaging data, the same conditioning occurs but without transparency. The model may produce predictions that vary systematically across demographic groups, through pathways that are neither specified by the designer nor accessible to the clinician.

Multimodal models add a further layer of complexity. Vision-language models such as GPT-4V and Gemini Pro Vision can integrate medical images with clinical text, offering the possibility of contextual interpretation that neither modality alone supports. But their diagnostic behavior has been shown to be sensitive to prompt construction in ways that interact with patient demographics — raising questions about whether these models can be reliably applied across populations without careful auditing of how prompt variation affects outputs. This is the landscape within which Chapter 3 operates.

The trajectory from classical risk scores to foundation models represents a genuine expansion of what can be learned about individual patients. But each step in that progression also expands the surface area for bias. Classical models encoded bias through explicit covariates that could be inspected and debated. Deep learning models encode it through learned representations that are harder to audit. Foundation models, trained on the full record of healthcare delivery, absorb not just biological signal but the structural conditions under which that signal was generated. The most expressive models — those most capable of learning individual norms — are also the most capable of learning the inequities of the systems they were trained on. This makes interpretability and auditing not peripheral concerns but essential ones: without tools to understand what a model has learned and how it varies across groups, personalization risks becoming a more sophisticated form of the same population-level logic it was meant to replace.

0.6 Thesis Outline