Signal-noise
A middle path
Based on a presentation to and discussion with the PSI Data Science group.
“Noise” is a ubiquitous idea within analytics and especially within biomedical analytics - unwanted, extraneous or corrupted data points that obscure the real pattern you want to model. That noise can come from measurement error, sensor imprecision, mislabeling, ambiguous clinical endpoints, qualitative scores, outliers, cross-site variability, (yadda yadda yadda) or - as is the case within biology - simple natural variation. Many, if not most, biological signals just seem to be naturally “fuzzy”, going up and down slightly for seemingly no reason. If you don’t handle this noise well, models can try to overfit on the noise, or collapse when applied to the real world.
Machine learning has spent the last decade chasing bigger architectures, bigger datasets, and bigger compute budgets while largely ignoring this issue. We pretend that all data points are equally informative, or that there’s “good” data and “bad” data and we simply have to decide which is which, purge the bad so we have only irreproachable data left.
There’s a bit of back literature here that I’m going to elide, but the last several years have seen various attempts as at probabilistic approaches to training on noisy data, where the model infers which points are likely corrupted and reduces their influence. Instead of pretending noise doesn’t exist, they integrate uncertainty directly into the training loop. And this has had clear benefits: a study on EHRs showed that messy chaos of clinical practice made patient data noisy and noise-robust models significantly outperform standard ones on real hospital data.
Alzaraiee and Niswonger (2025) present an interesting take on this (see: https://www.sciencedirect.com/science/article/pii/S1364815224001944). They propose a probabilistic training approach combining classical ML with Markov Chain Monte Carlo (MCMC) simulation, aimed at detecting and “under-weighting” likely noisy data points during training. In effect, rather than treating all training examples equally, this method tries to infer which data are likely “noise / corrupted / low-confidence,” and reduce their influence on the model’s parameter estimation. 
An aside: this is not a question of classifying data as noise / not-noise. That just puts us back in the bad data / good data paradigm. What this and related methods try to do is detect which data is “noisy” (not pure “noise”), quantify the extent of noisiness and weight the data accordingly. In contrast to preprocessing-only noise removal, this integrates noise estimation into the training loop, takes out a human-in-the-loop step and make the noise handling more consistent and quantifiable.
For those of you with a Bayesian tendency, the MCMC approach will appeal. The approach, in a sense, is modelling the data, using a Markov process to “tour” the data distribution. This distribution is a set of plausible splits between the data (noise / not-noise), but because it’s an ensemble, points can be assigned a degree of noisiness rather than a binary good/bad.
(Thankfully, the authors actually validate the approach: first on some synthetic datasets with added noise and then on real-world public water supply data which “may contain authentic anomalous data and unknown noise caused by sensor errors, human data input errors, and errors in the incorrect association between water withdrawals and populations served”.
Some thoughts:
Probabilistic noise-detection & re-weighting inescapably rests on assumptions about the distribution of noise and data. If those assumptions are violated (e.g. noise is structured, systematic bias rather than random errors), it’s unclear what would happen. This seems like an impossible point to fix.
Noise isn’t always noise; sometimes it’s rare, weird data. And methods like this have a trade-off: under-weighting noisy data reduces overfitting but might also reduce sensitivity to these rare but real patterns (e.g. rare disease variants, outlier responses, sub-phenotypes). This also seems impossible to handle … but perhaps the correct attitude to take is that an outlier is an outlier, no matter what causes it.
The computational complexity seems fine, but there’s no way this is handling omics data any time soon. There’s an inevitable overhead combining ML training with MCMC.
Some ideas:
This might not be capable of handling most ‘omics data, but it seems like it could handle (say) patient data that was in the 10K range. Or perhaps after preliminary data filtering and feature selection, datasets would be small enough to handle (e.g. epigenetics)
Perhaps it could be used federated or privacy-aware learning contexts (e.g. data pooled from different centres), where data quality is heterogeneous and discarding data is not desirable. Noise-aware weighting could improve model robustness across centres.
Something interesting for sure. And the Python code is available: https://github.com/aymanalz/outlier_detector


