The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier ...detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. In this paper, we perform an extensive experimental study on the performance of a representative set of standard
k
nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.
This paper introduces a data structure for k-NN search, the Rank Cover Tree (RCT), whose pruning tests rely solely on the comparison of similarity values; other properties of the underlying space, ...such as the triangle inequality, are not employed. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. A formal theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.
Properties of data distributions can be assessed at both global and local scales. At a highly localized scale, a fundamental measure is the local intrinsic dimensionality (LID), which assesses growth ...rates of the cumulative distribution function within a restricted neighborhood and characterizes properties of the geometry of a local neighborhood. In this paper, we explore the connection of LID to other well known measures for complexity assessment and comparison, namely, entropy and statistical distances or divergences. In an asymptotic context, we develop analytical new expressions for these quantities in terms of LID. This reveals the fundamental nature of LID as a building block for characterizing and comparing data distributions, opening the door to new methods for distributional analysis at a local scale.
Machine learning systems are vulnerable to adversarial attack. By applying to the input object a small, carefully-designed perturbation, a classifier can be tricked into making an incorrect ...prediction. This phenomenon has drawn wide interest, with many attempts made to explain it. However, a complete understanding is yet to emerge. In this paper we adopt a slightly different perspective, still relevant to classification. We consider retrieval, where the output is a set of objects most similar to a user-supplied query object, corresponding to the set of <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-nearest neighbors. We investigate the effect of adversarial perturbation on the ranking of objects with respect to a query. Through theoretical analysis, supported by experiments, we demonstrate that as the intrinsic dimensionality of the data domain rises, the amount of perturbation required to subvert neighborhood rankings diminishes, and the vulnerability to adversarial attack rises. We examine two modes of perturbation of the query: either 'closer' to the target point, or 'farther' from it. We also consider two perspectives: 'query-centric', examining the effect of perturbation on the query's own neighborhood ranking, and 'target-centric', considering the ranking of the query point in the target's neighborhood set. All four cases correspond to practical scenarios involving classification and retrieval.
Preterm and/or low birthweight (PT/LBW) is predictive of a range of adverse adult outcomes, including lower employment, educational attainment, and mental wellbeing, and higher welfare receipt. ...Existing studies, however, on PT/LBW and adult psychosocial risks are often limited by low statistical power. Studies also fail to examine potential child or adolescent pathways leading to later adult adversity. Using a life course framework, we examine how adolescent problem behaviors may moderate the association between PT/LBW and a multidimensional measure of life success at age 30 to potentially address these limitations.
We analyze 2044 respondents from a Brisbane, Australia cohort followed from birth in1981–1984 through age 30. We examine moderation patterns using obstetric birth outcomes for weight and gestation, measures of problem behaviors from the Child Behavioral Checklist at age 14, and measures of educational attainment and life success at 30 using multivariable normal and ordered logistic regression.
Associations between PT/LBW and life success was found to be moderated by adolescent problem behaviors in six scales, including CBCL internalizing, externalizing, and total problems (all p < 0.01). In comparison, associations between LBW and educational attainment illustrate how a single-dimensional measure may yield null results.
For PT/LBW, adolescent problem behaviors increase risk of lower life success at age 30. Compared to analysis of singular outcomes, the incorporation of multidimensional measures of adult wellbeing, paired with identification of risk and protective factors for adult life success as children develop over the lifespan, may further advance existing research and interventions for PT/LBW children.
•Preterm and/or low birth weight (PT/LBW) predicts lower adult psychosocial outcomes.•Potential underlying chiid/adolescent pathways are largely unstudied in research.•Using a life course framework, we examine if adolescent behaviors may play a role.•Adolescent behaviors moderate PT/LBW and overall life success at age 30.•Results suggest that exploring such potential pathways may benefit future research.
This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl’s ...expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.
The Agouti-like peptides including AgRP, ASIP and the teleost-specific A2 (ASIP2 and AgRP2) peptides have potent and diverse functional roles in feeding, pigmentation and background adaptation ...mechanisms. There are contradictory theories about the evolution of the Agouti-like peptide family as well the nomenclature. Here we performed comprehensive mining and annotation of vertebrate Agouti-like sequences. We identified A2 sequences from salmon, trout, seabass, cod, cichlid, tilapia, gilt-headed sea bream, Antarctic toothfish, rainbow smelt, common carp, channel catfish and interestingly also in lobe-finned fish. Moreover, we surprisingly found eight novel homologues from the kingdom of arthropods and three from fungi, some sharing the characteristic C-x(6)-C-C motif which are present in the Agouti-like sequences, as well as approximate sequence length (130 amino acids), positioning of the motif sequence and sharing of exon-intron structures that are similar to the other Agouti-like peptides providing further support for the common origin of these sequences. Phylogenetic analysis shows that the AgRP sequences cluster basally in the tree, suggesting that these sequences split from a cluster containing both the ASIP and the A2 sequences. We also used a novel approach to determine the statistical evidence for synteny, a sinusoidal Hough transform pattern recognition technique. Our analysis shows that the teleost AgRP2 resides in a chromosomal region that has synteny with Hsa 8, but we found no convincing synteny between the regions that A2, AgRP and ASIP reside in, which would support that the Agouti-like peptides were formed by whole genome tetraplodization events. Here we suggest that the Agouti-like peptide genes were formed through classical subsequent gene duplications where the AgRP is the most distantly related to the three other members of that group, first splitting from a common ancestor to ASIP and A2, and then later the A2 split from ASIP followed by a split resulting in ASIP2 and AgRP2.