Unsupervised anomaly detection (UAD) is a diverse research area explored across various application domains. Over time, numerous anomaly detection techniques, including clustering, generative, and ...variational inference-based methods, are developed to address specific drawbacks and advance state-of-the-art techniques. Deep learning and generative models recently played a significant role in identifying unique challenges and devising advanced approaches. Auto-encoders (AEs) represent one such powerful technique that combines generative and probabilistic variational modeling with deep architecture. Auto-Encoder aims to learn the underlying data distribution to generate consequential sample data. This concept of data generation and the adoption of generative modeling have emerged in extensive research and variations in Auto-Encoder design, particularly in unsupervised representation learning. This study systematically reviews 11 Auto-Encoder architectures categorized into three groups, aiming to differentiate their reconstruction ability, sample generation, latent space visualization, and accuracy in classifying anomalous data using the Fashion-MNIST (FMNIST) and MNIST datasets. Additionally, we closely observed the reproducibility scope under different training parameters. We conducted reproducibility experiments utilizing similar model setups and hyperparameters and attempted to generate comparative results to address the scope of improvements for each Auto-Encoder. We conclude this study by analyzing the experimental results, which guide us in identifying the efficiency and trade-offs among auto-encoders, providing valuable insights into their performance and applicability in unsupervised anomaly detection techniques.
•Classify novel Auto-Encoder architectures and explore improvement opportunities.•In-depth analysis of generative and non-generative Auto-Encoder models.•Ranking Auto-Encoder models based on F1-score and ROC analysis.•Address lack of reproducibility and tunable parameters issues on different architectures.•Providing in detail Efficiency and Trade-Offs for advancing anomaly detection using Auto-Encoders.
An unprecedented amount of SARS-CoV-2 sequencing has been performed, however, novel bioinformatic tools to cope with and process these large datasets is needed. Here, we have devised a bioinformatic ...pipeline that inputs SARS-CoV-2 genome sequencing in FASTA/FASTQ format and outputs a single Variant Calling Format file that can be processed to obtain variant annotations and perform downstream population genetic testing. As proof of concept, we have analyzed over 229,000 SARS-CoV-2 viral sequences up until November 30, 2020. We have identified over 39,000 variants worldwide with increased polymorphisms, spanning the ORF3a gene as well as the 3′ untranslated (UTR) regions, specifically in the conserved stem loop region of SARS-CoV-2 which is accumulating greater observed viral diversity relative to chance variation. Our analysis pipeline has also discovered the existence of SARS-CoV-2 hypermutation with low frequency (less than in 2% of genomes) likely arising through host immune responses and not due to sequencing errors. Among annotated non-sense variants with a population frequency over 1%, recurrent inactivation of the ORF8 gene was found. This was found to be present in the newly identified B.1.1.7 SARS-CoV-2 lineage that originated in the United Kingdom. Almost all VOC-containing genomes possess one stop codon in ORF8 gene (Q27
∗
), however, 13% of these genomes also contains another stop codon (K68
∗
), suggesting that ORF8 loss does not interfere with SARS-CoV-2 spread and may play a role in its increased virulence. We have developed this computational pipeline to assist researchers in the rapid analysis and characterization of SARS-CoV-2 variation.
In observational studies, type-2 diabetes (T2D) is associated with an increased risk of coronary heart disease (CHD), yet interventional trials have shown no clear effect of glucose-lowering on CHD. ...Confounding may have therefore influenced these observational estimates. Here we use Mendelian randomization to obtain unconfounded estimates of the influence of T2D and fasting glucose (FG) on CHD risk. Using multiple genetic variants associated with T2D and FG, we find that risk of T2D increases CHD risk (odds ratio (OR)=1.11 (1.05-1.17), per unit increase in odds of T2D, P=8.8 × 10(-5); using data from 34,840/114,981 T2D cases/controls and 63,746/130,681 CHD cases/controls). FG in non-diabetic individuals tends to increase CHD risk (OR=1.15 (1.00-1.32), per mmol·per l, P=0.05; 133,010 non-diabetic individuals and 63,746/130,681 CHD cases/controls). These findings provide evidence supporting a causal relationship between T2D and CHD and suggest that long-term trials may be required to discern the effects of T2D therapies on CHD risk.
In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance ...than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines the case-base sampling framework with flexible neural network architectures. Using a novel sampling scheme and data augmentation to naturally account for censoring, we construct a feed-forward neural network that includes time as an input. CBNNs predict the probability of an event occurring at a given moment to estimate the full hazard function. We compare the performance of CBNNs to regression and neural network-based survival methods in a simulation and three case studies using two time-dependent metrics. First, we examine performance on a simulation involving a complex baseline hazard and time-varying interactions to assess all methods, with CBNN outperforming competitors. Then, we apply all methods to three real data applications, with CBNNs outperforming the competing models in two studies and showing similar performance in the third. Our results highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible framework for data-driven modeling of single event survival outcomes that estimates time-varying effects and a complex baseline hazard by design. An R package is available at https://github.com/Jesse-Islam/cbnn.
•Case-Base Neural Networks (CBNNs) estimate the full hazard function.•Naturally accounts for censoring and predicts smooth-in-time risk functions.•Uses a simple objective function, unlike competing methods.•Models time-varying effects by design, unlike competing methods.•CBNNs outperform the competing models in a simulation and two studies.
Clustering techniques are used to group observations and discover interesting patterns within data. Model-based clustering is one such method that is often an attractive choice due to the ...specification of a generative model for the given data and the ability to calculate model-selection criteria, which is in turn used to select the number of clusters. However, when only distances between observations are available, model-based clustering can no longer be used, and heuristic algorithms without the aforementioned advantages are usually used instead. As a solution, Oh and Raftery (2007) suggest a Bayesian model-based clustering method (named BMCD) that only requires a dissimilarity matrix as input, while also accounting for the measurement error that may be present within the observed data. In this paper, we extend the BMCD framework by proposing several additional models, alternative model selection criteria, and strategies for reducing computing time of the algorithm. These extensions ensure that the algorithm is effective even in high-dimensional spaces and provides a wide range of choices to the practitioner that can be used with a variety of data. Additionally, a publicly available software implementation of the algorithm is provided as a package in the R programming language.
Smoking is a risk factor for many chronic diseases. Multiple smoking status ascertainment algorithms have been developed for population-based electronic health databases such as administrative ...databases and electronic medical records (EMRs). Evidence syntheses of algorithm validation studies have often focused on chronic diseases rather than risk factors. We conducted a systematic review and meta-analysis of smoking status ascertainment algorithms to describe the characteristics and validity of these algorithms.
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines were followed. We searched articles published from 1990 to 2022 in EMBASE, MEDLINE, Scopus, and Web of Science with key terms such as validity, administrative data, electronic health records, smoking, and tobacco use. The extracted information, including article characteristics, algorithm characteristics, and validity measures, was descriptively analyzed. Sources of heterogeneity in validity measures were estimated using a meta-regression model. Risk of bias (ROB) in the reviewed articles was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 tool.
The initial search yielded 2086 articles; 57 were selected for review and 116 algorithms were identified. Almost three-quarters (71.6%) of algorithms were based on EMR data. The algorithms were primarily constructed using diagnosis codes for smoking-related conditions, although prescription medication codes for smoking treatments were also adopted. About half of the algorithms were developed using machine-learning models. The pooled estimates of positive predictive value, sensitivity, and specificity were 0.843, 0.672, and 0.918 respectively. Algorithm sensitivity and specificity were highly variable and ranged from 3 to 100% and 36 to 100%, respectively. Model-based algorithms had significantly greater sensitivity (p = 0.006) than rule-based algorithms. Algorithms for EMR data had higher sensitivity than algorithms for administrative data (p = 0.001). The ROB was low in most of the articles (76.3%) that underwent the assessment.
Multiple algorithms using different data sources and methods have been proposed to ascertain smoking status in electronic health data. Many algorithms had low sensitivity and positive predictive value, but the data source influenced their validity. Algorithms based on machine-learning models for multiple linked data sources have improved validity.
The genomics era has led to an increase in the dimensionality of data collected in the investigation of biological questions. In this context, dimension-reduction techniques can be used to summarise ...high-dimensional signals into low-dimensional ones, to further test for association with one or more covariates of interest. This paper revisits one such approach, previously known as principal component of heritability and renamed here as principal component of explained variance (PCEV). As its name suggests, the PCEV seeks a linear combination of outcomes in an optimal manner, by maximising the proportion of variance explained by one or several covariates of interest. By construction, this method optimises power; however, due to its computational complexity, it has unfortunately received little attention in the past. Here, we propose a general analytical PCEV framework that builds on the assets of the original method, i.e. conceptually simple and free of tuning parameters. Moreover, our framework extends the range of applications of the original procedure by providing a computationally simple strategy for high-dimensional outcomes, along with exact and asymptotic testing procedures that drastically reduce its computational cost. We investigate the merits of the PCEV using an extensive set of simulations. Furthermore, the use of the PCEV approach is illustrated using three examples taken from the fields of epigenetics and brain imaging.
Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. A widespread ...analytical goal in these fields is to investigate the relationships between, on the one hand, a group of genomic markers or anatomical brain measurements and, on the other hand, a set of clinical variables or phenotypes. To leverage the correlation within each set of measurements, and to improve the interpretability of a measure of the association, one can use dimension reduction techniques: one, or both, group of variables can be summarised by a small set of latent features that summarise the structure of interest and capture association through an appropriately chosen statistic. But the high-dimensionality of contemporary datasets brings many computational and theoretical challenges, and most classical multivariate methods cannot be used directly.This thesis is comprised primarily of three manuscripts that investigate the issues related to measuring association in high dimensional datasets. In the first manuscript, I explore the optimality properties of a dimension reduction method known as Principal Component of Explained Variance (PCEV). This method seeks a linear combination of the outcome variables that maximises the proportion of variance explained by a set of covariates of interest. I then explain how PCEV can be extended to a computationally simple and efficient estimation strategy for high-dimensional outcomes (p > n) that relies on a "block-independence" assumption. In the second manuscript, I study the problem of inference with high-dimensional datasets: given two datasets Y and X, with one or both being high-dimensional, how can we perform a test of association in a computationally efficient way? Specifically, I look at the set of multivariate methods that can be described as a double Wishart problem; PCEV, Canonical Correlation Analysis (CCA), and Multivariate Analysis of Variance (MANOVA) are all examples of double Wishart problems. I show that valid high-dimensional p-values can be derived using an empirical estimator of the null distribution. This is achieved by performing a small number of permutations, and then fitting a location-scale family of the Tracy-Widom distribution of order 1 to the test statistics computed from the permuted data. Finally, in the third manuscript, I apply the concepts developed in the two other manuscripts to a data analysis of targeted custom capture bisulfite methylation data. I show how PCEV can be used in conjunction with the ideas in the second manuscript to test for a region-level association between the methylation levels of CpG dinucleotides and levels of anti-citrullinated protein antibody (ACPA), an antigen thought to be a predictor of rheumatoid arthritis onset. In this study, the CpG dinucleotides are naturally grouped by design, and several of these groups contain a number of methylation measurements that is larger than the sample size.
Les garanties de rachat viager (GLWB) ont fait l’objet de nombreuses analyses dans la littérature en raison de leur risque financier, mais peu d’articles ont jusqu’ici traité de l’option offerte au ...contractant par rapport au choix du moment de décaissement et ses impacts sur la rentabilité du produit auprès de l’assureur.
Cet article étend l’analyse effectuée par Huang et al., 2014 dans leur article portant sur le choix optimal de décaissement pour un produit avec garanties de rachat viager. Tout d’abord, nous ajoutons une dimension additionnelle dans l’analyse pour tenir compte de la distribution des pertes d’un assureur selon l’âge au décaissement choisi par le contractant. Ensuite, nous développons un cadre d’analyse novateur afin de déterminer numériquement dans quelle mesure un assureur devrait modifier son échelle de frais lorsque ce dernier s’attend à ce qu’un assuré choisisse un moment de décaissement lui permettant de maximiser sa valeur de contrat. Nous démontrons que le niveau de frais équitables est fonction de l’âge à l’émission de l’assuré. Cette observation va à l’encontre de la pratique et de la structure présente de frais au sein de l’industrie canadienne où les assureurs chargent un niveau de frais uniformes indépendamment de l’âge à l’émission de l’assuré.
In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance ...than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines the case-base sampling framework with flexible neural network architectures. Using a novel sampling scheme and data augmentation to naturally account for censoring, we construct a feed-forward neural network that includes time as an input. CBNNs predict the probability of an event occurring at a given moment to estimate the full hazard function. We compare the performance of CBNNs to regression and neural network-based survival methods in a simulation and three case studies using two time-dependent metrics. First, we examine performance on a simulation involving a complex baseline hazard and time-varying interactions to assess all methods, with CBNN outperforming competitors. Then, we apply all methods to three real data applications, with CBNNs outperforming the competing models in two studies and showing similar performance in the third. Our results highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible framework for data-driven modeling of single event survival outcomes that estimates time-varying effects and a complex baseline hazard by design. An R package is available at https://github.com/Jesse-Islam/cbnn.