We systematically evaluate the training methodology and efficacy of two inpainting-based pretext tasks of context prediction and context restoration for medical image segmentation using ...self-supervised learning (SSL). Multiple versions of self-supervised U-Net models were trained to segment MRI and CT datasets, each using a different combination of design choices and pretext tasks to determine the effect of these design choices on segmentation performance. The optimal design choices were used to train SSL models that were then compared with baseline supervised models for computing clinically-relevant metrics in label-limited scenarios. We observed that SSL pretraining with context restoration using 32 × 32 patches and Poission-disc sampling, transferring only the pretrained encoder weights, and fine-tuning immediately with an initial learning rate of 1 × 10-3 provided the most benefit over supervised learning for MRI and CT tissue segmentation accuracy (
< 0.001). For both datasets and most label-limited scenarios, scaling the size of unlabeled pretraining data resulted in improved segmentation performance. SSL models pretrained with this amount of data outperformed baseline supervised models in the computation of clinically-relevant metrics, especially when the performance of supervised learning was low. Our results demonstrate that SSL pretraining using inpainting-based pretext tasks can help increase the robustness of models in label-limited scenarios and reduce worst-case errors that occur with supervised learning.
Artificial intelligence (AI) has revolutionized multiple fields including safety-critical domains such as healthcare. It has shown remarkable potential for building both diagnostic and predictive ...models in medicine using various types of healthcare data. However, despite its potential, there are two major barriers to medical AI development and subsequent adoption to healthcare systems: 1) Training AI models that perform well with a limited amount of labeled data is challenging. However, curating large labeled datasets is costly; and might not be possible in several cases; 2) Even well-trained, state-of-the-art models, with impressive accuracies on their test sets and developed with rigorous validation and testing, may fail to generalize to new patients when deployed and they may be brittle under distribution shifts. This reduces trust in model capabilities and limits their adoption into clinical practice.In this thesis, I address the above barriers to medical AI development and deployment, namely robustness, data-efficiency and model trust, by presenting three different works that improve upon the current state-of-the-art for various modalities.In the first part of my thesis, I present observational supervision, a novel supervision paradigm wherein we use passively collected, auxiliary metadata to train AI models. I use observational supervision to tackle the major challenge of training robust, high performing models with limited training data for clinical outcome prediction. Clinical outcome prediction models can improve medical care and aid in clinical decision making, but they are typically presented with limited training data, resulting in models with narrow capabilities and reduced generalization. Audit logs are an often underutilized, passively collected data source in electronic health record (EHR) systems that capture the interactions of clinicians with the EHR and represent observational signals. Our proposed method of leveraging observational supervision for structured electronic health records using audit logs in conjunction with clinical data improves both performance and robustness of AI models trained to predict clinical outcomes in two clinically important diseases (acute kidney injury and acute ischemic stroke), even with limited labeled training data.In the second part of my thesis, I propose domain-specific augmentation strategies for selfsupervised foundation models that enable large scale, label-efficient training of AI models. I tackle the major challenge of model robustness and label-efficiency. The foundation model paradigm involves pretraining models using large quantities of data in a self-supervised manner and then adapting the pretrained model to different downstream tasks. Foundation models provide an opportunity for improving model robustness in a label-efficient fashion. Augmentations or transformations of the input are key to the success of foundation models; however, medical images are very different from natural images and need specialized augmentation strategies. Our proposed augmentation strategies for medical images result in a domain-specific foundation model that improves performance over data-hungry, fully supervised models for chest X-ray classification, and generalizes to both unseen populations and out-of-distribution data with limited labels.In the third part of my thesis, I present TRUST-LAPSE, an explainable, post-hoc and actionable trust-scoring framework for continuous AI model monitoring. I tackle the major challenge of model trust. AI models, despite their success on test sets, require label-free, continuous model monitoring that can quantify trust in their predictions to ensure safe and reliable deployment. Techniques such as classical uncertainty estimation, confidence calibration and Bayesian networks are currently employed for that purpose, and suffer from several limitations.
Audit logs in electronic health record (EHR) systems capture interactions of providers with clinical data. We determine if machine learning (ML) models trained using audit logs in conjunction with ...clinical data ("observational supervision") outperform ML models trained using clinical data alone in clinical outcome prediction tasks, and whether they are more robust to temporal distribution shifts in the data.
Using clinical and audit log data from Stanford Healthcare, we trained and evaluated various ML models including logistic regression, support vector machine (SVM) classifiers, neural networks, random forests, and gradient boosted machines (GBMs) on clinical EHR data, with and without audit logs for two clinical outcome prediction tasks: major adverse kidney events within 120 days of ICU admission (MAKE-120) in acute kidney injury (AKI) patients and 30-day readmission in acute stroke patients. We further tested the best performing models using patient data acquired during different time-intervals to evaluate the impact of temporal distribution shifts on model performance.
Performance generally improved for all models when trained with clinical EHR data and audit log data compared with those trained with only clinical EHR data, with GBMs tending to have the overall best performance. GBMs trained with clinical EHR data and audit logs outperformed GBMs trained without audit logs in both clinical outcome prediction tasks: AUROC 0.88 (95% CI: 0.85-0.91) vs. 0.79 (95% CI: 0.77-0.81), respectively, for MAKE-120 prediction in AKI patients, and AUROC 0.74 (95% CI: 0.71-0.77) vs. 0.63 (95% CI: 0.62-0.64), respectively, for 30-day readmission prediction in acute stroke patients. The performance of GBM models trained using audit log and clinical data degraded less in later time-intervals than models trained using only clinical data.
Observational supervision with audit logs improved the performance of ML models trained to predict important clinical outcomes in patients with AKI and acute stroke, and improved robustness to temporal distribution shifts.
Continuous monitoring of trained ML models to determine when their predictions should and should not be trusted is essential for their safe deployment. Such a framework ought to be high-performing, ...explainable, post hoc , and actionable. We propose TRUST-LAPSE, a "mistrust" scoring framework for continuous model monitoring. We assess the trustworthiness of each input sample's model prediction using a sequence of latent-space embeddings. Specifically, 1) our latent-space mistrust score estimates mistrust using distance metrics (Mahalanobis distance) and similarity metrics (cosine similarity) in the latent-space, and 2) our sequential mistrust score determines deviations in correlations over the sequence of past input representations in a nonparametric, sliding-window-based algorithm for actionable continuous monitoring. We evaluate TRUST-LAPSE via two downstream tasks: 1) distributionally shifted input detection; and 2) data drift detection. We evaluate across diverse domains-audio and vision using public datasets and further benchmark our approach on challenging, real-world electroencephalograms (EEG) datasets for seizure detection. Our latent-space mistrust scores achieve state-of-the-art results with AUROCs of 84.1 (vision), 73.9 (audio), and 77.1 (clinical EEGs), outperforming baselines by over 10 points. We expose critical failures in popular baselines that remain insensitive to input semantic content, rendering them unfit for real-world model monitoring. We show that our sequential mistrust scores achieve high drift detection rates; over 90% of the streams show <inline-formula><tex-math notation="LaTeX"><\! 20{\%}</tex-math></inline-formula> error for all domains. Through extensive qualitative and quantitative evaluations, we show that our mistrust scores are more robust and provide explainability for easy adoption into practice.
Epiretinal prostheses for treating blindness activate axon bundles, causing large, arc-shaped visual percepts that limit the quality of artificial vision. Improving the function of epiretinal ...prostheses therefore requires understanding and avoiding axon bundle activation. This study introduces a method to detect axon bundle activation on the basis of its electrical signature and uses the method to test whether epiretinal stimulation can directly elicit spikes in individual retinal ganglion cells without activating nearby axon bundles. Combined electrical stimulation and recording from isolated primate retina were performed using a custom multielectrode system (512 electrodes, 10-μm diameter, 60-μm pitch). Axon bundle signals were identified by their bidirectional propagation, speed, and increasing amplitude as a function of stimulation current. The threshold for bundle activation varied across electrodes and retinas, and was in the same range as the threshold for activating retinal ganglion cells near their somas. In the peripheral retina, 45% of electrodes that activated individual ganglion cells (17% of all electrodes) did so without activating bundles. This permitted selective activation of 21% of recorded ganglion cells (7% of expected ganglion cells) over the array. In one recording in the central retina, 75% of electrodes that activated individual ganglion cells (16% of all electrodes) did so without activating bundles. The ability to selectively activate a subset of retinal ganglion cells without axon bundles suggests a possible novel architecture for future epiretinal prostheses.
Large-scale multielectrode recording and stimulation were used to test how selectively retinal ganglion cells can be electrically activated without activating axon bundles. A novel method was developed to identify axon activation on the basis of its unique electrical signature and was used to find that a subset of ganglion cells can be activated at single-cell, single-spike resolution without producing bundle activity in peripheral and central retina. These findings have implications for the development of advanced retinal prostheses.
Objective : Retinal prostheses must be able to activate cells in a selective way in order to restore high-fidelity vision. However, inadvertent activation of far-away retinal ganglion cells (RGCs) ...through electrical stimulation of axon bundles can produce irregular and poorly controlled percepts, limiting artificial vision. In this work, we aim to provide an algorithmic solution to the problem of detecting axon bundle activation with a bi-directional epiretinal prostheses. Methods : The algorithm utilizes electrical recordings to determine the stimulation current amplitudes above which axon bundle activation occurs. Bundle activation is defined as the axonal stimulation of RGCs with unknown soma and receptive field locations, typically beyond the electrode array. The method exploits spatiotemporal characteristics of electrically-evoked spikes to overcome the challenge of detecting small axonal spikes. Results : The algorithm was validated using large-scale, single-electrode and short pulse, ex vivo stimulation and recording experiments in macaque retina, by comparing algorithmically and manually identified bundle activation thresholds. For 88% of the electrodes analyzed, the threshold identified by the algorithm was within ±10% of the manually identified threshold, with a correlation coefficient of 0.95. Conclusion : This works presents a simple, accurate and efficient algorithm to detect axon bundle activation in epiretinal prostheses. Significance : The algorithm could be used in a closed-loop manner by a future epiretinal prosthesis to reduce poorly controlled visual percepts associated with bundle activation. Activation of distant cells via axonal stimulation will likely occur in other types of retinal implants and cortical implants, and the method may therefore be broadly applicable.
Labeled data is a critical resource for training and evaluating machine learning models. However, many real-life datasets are only partially labeled. We propose a semi-supervised machine learning ...training strategy to improve event detection performance on sequential data, such as video recordings, when only sparse labels are available, such as event start times without their corresponding end times. Our method uses noisy guesses of the events' end times to train event detection models. Depending on how conservative these guesses are, mislabeled samples may be introduced into the training set. We further propose a mathematical model for explaining and estimating the evolution of the classification performance for increasingly noisier end time estimates. We show that neural networks can improve their detection performance by leveraging more training data with less conservative approximations despite the higher proportion of incorrect labels. We adapt sequential versions of CIFAR-10 and MNIST, and use the Berkeley MHAD and HMBD51 video datasets to empirically evaluate our method, and find that our risk-tolerant strategy outperforms conservative estimates by 3.5 points of mean average precision for CIFAR, 30 points for MNIST, 3 points for MHAD, and 14 points for HMBD51. Then, we leverage the proposed training strategy to tackle a real-life application: processing continuous video recordings of epilepsy patients, and show that our method outperforms baseline labeling methods by 17 points of average precision, and reaches a classification performance similar to that of fully supervised models. We share part of the code for this article at the following repository: fpgdubost/CIFAR-10-Sparsely-Labeled-Sequential-Data.
Continuous monitoring of trained ML models to determine when their predictions should and should not be trusted is essential for their safe deployment. Such a framework ought to be high-performing, ...explainable, post-hoc and actionable. We propose TRUST-LAPSE, a "mistrust" scoring framework for continuous model monitoring. We assess the trustworthiness of each input sample's model prediction using a sequence of latent-space embeddings. Specifically, (a) our latent-space mistrust score estimates mistrust using distance metrics (Mahalanobis distance) and similarity metrics (cosine similarity) in the latent-space and (b) our sequential mistrust score determines deviations in correlations over the sequence of past input representations in a non-parametric, sliding-window based algorithm for actionable continuous monitoring. We evaluate TRUST-LAPSE via two downstream tasks: (1) distributionally shifted input detection, and (2) data drift detection. We evaluate across diverse domains - audio and vision using public datasets and further benchmark our approach on challenging, real-world electroencephalograms (EEG) datasets for seizure detection. Our latent-space mistrust scores achieve state-of-the-art results with AUROCs of 84.1 (vision), 73.9 (audio), and 77.1 (clinical EEGs), outperforming baselines by over 10 points. We expose critical failures in popular baselines that remain insensitive to input semantic content, rendering them unfit for real-world model monitoring. We show that our sequential mistrust scores achieve high drift detection rates; over 90% of the streams show < 20% error for all domains. Through extensive qualitative and quantitative evaluations, we show that our mistrust scores are more robust and provide explainability for easy adoption into practice.
Multipliers are core components of most of the digital signal processing algorithms which lie in critical delay path and decide performance of any algorithm. Over the years various approaches have ...been proposed to reduce the computational overhead of conventional multipliers. Vedic mathematics has been one among them. In this paper, a novel multiplier unit is proposed which integrates the advantage of each of the sutras. "Sampoornam" alias "Absolute vedic" multiplier is designed to have a specialized logic unit that decides which multiplier is to be used for optimum results based on the types of input, improving efficiency. The proposed multiplier Sampoornam is used for designing a 4-bit Multiplier accumulator unit (MAC) unit and is extended up to 64-bit using Vedic scaling technique. Sampoornam is comparatively time efficient than the present day multipliers such as (a*b) algorithm, Booth and Wallace. The 4-bit MAC unit developed using sampoornam has 25 % reduction in time delay compared to MAC developed using Wallace multiplier. Similar trend is observed as the number of bits is increased.
Image augmentations are quintessential for effective visual representation learning across self-supervised learning techniques. While augmentation strategies for natural imaging have been studied ...extensively, medical images are vastly different from their natural counterparts. Thus, it is unknown whether common augmentation strategies employed in Siamese representation learning generalize to medical images and to what extent. To address this challenge, in this study, we systematically assess the effect of various augmentations on the quality and robustness of the learned representations. We train and evaluate Siamese Networks for abnormality detection on chest X-Rays across three large datasets (MIMIC-CXR, CheXpert and VinDR-CXR). We investigate the efficacy of the learned representations through experiments involving linear probing, fine-tuning, zero-shot transfer, and data efficiency. Finally, we identify a set of augmentations that yield robust representations that generalize well to both out-of-distribution data and diseases, while outperforming supervised baselines using just zero-shot transfer and linear probes by up to 20%. Our code is available at https://github.com/StanfordMIMI/siaug.