BACKGROUND: One of the common problems in labeling medical images is inter-observer variability. The same image can be labeled differently by doctors. The main reasons are the human factor, ...differences in experience and qualifications, different radiology schools, poor image quality, and unclear instructions. The influence of some factors can be reduced by proper organization of the annotation; however, the opinion of doctors frequently differs.
AIM: The study aimed to test whether a neural network with an additional module can learn the style and labeling features of different radiologists and whether such modeling can improve the final metrics of object detection on radiological images.
METHODS: For training artificial intelligence systems in radiology, cross-labeling, i.e., annotation of the same image by several doctors, is frequently used. The easiest way is to use labeling from each doctor as an independent example when training the model. Some methods use different rules or algorithms to combine annotation before training. Finally, Guan et al. use separate classification heads to model the labeling style of different doctors. Unfortunately, this method is not suitable for more complex tasks, such as detecting objects on an image. For this analysis, a machine learning model designed to detect objects of different classes on mammographic scans was used. This model is a neural network based on Deformable DETR architecture. A dataset consisting of 7,756 mammographic breast scans and 12,543 unique annotations from 19 doctors was used to train the neural network. For validation and testing, a dataset consisting of 700 and 300 Bi-Rads-labeled scans, respectively, was taken. In all data sets, the proportion of images with pathology was in the 15%20% range. A unique index was assigned to each of the 19 doctors, and a special module at each iteration of the neural network training found a vector corresponding to this index. The vector was expanded to the size of the feature map of each level of the feature pyramid, and then attached by separate channels to the maps. Thus, the encoder and the decoder of the detector had access to the information about which doctor labeled the scan. The vectors were updated using the back-propagation method. Three methods were chosen for comparison:
Basic model: Combining labels by different doctors using the voting method.
New stylistic module: For predictions on the test dataset, a single doctors index was taken, which showed the best metrics on the validation dataset.
New stylistic module: The indexes of the five doctors with the best metrics on the validation dataset were used for predictions on the test dataset. Weighted Boxes Fusion was chosen to combine the predictions.
The area under the receiver operating characteristic curve (ROC-AUC) was used as the primary metric on the test dataset (Bi-Rads 3, 4, and 5 categories were referred to pathology). The sum of maximum probabilities of detected malignant objects (malignant masses and calcinates) by cranio-caudal and medio-lateral oblique projections was assumed as the probability of malignancy for each method.
RESULTS: The following ROC-AUC metrics were obtained for the three methods: 0.82, 0.87, and 0.89.
CONCLUSIONS: The information about the labeling doctor allows the neural network to learn and model the labeling style of different doctors more effectively. In addition, this method may obtain an estimate of the uncertainty of the networks prediction. The use of embedding from different doctors, leading to different predictions, may mean that this data is difficult for an artificial intelligence system to process.
•A feasible multistage intelligent data annotation strategy (MIDA) was proposed.•The MIDA provided an accurate and efficient visually annotation of complex sample.•Molecular network improved the ...cluster advantage of diagnostic ions.•Totally 279 components were identified from Fufang Yinhua Jiedu granules (FYJG).•Novel structures of 12 indole alkaloids and 4 organic acids were speculated.
Fufang Yinhua Jiedu granules (FYJG) is a Traditional Chinese Medicine (TCM) compound formulae preparation comprising ten herbal drugs, which has been widely used for the treatment of influenza with wind-heat type and upper respiratory tract infections. However, the phytochemical constituents of FYJG have rarely been reported, and its constituent composition still needs to be elucidated. The complexity of the natural ingredients of TCMs and the diversity of preparations are the major obstacles to fully characterizing their constituents. In this study, an innovative and intelligent analysis strategy was built to comprehensively characterize the constituents of FYJG and assign source attribution to all components. Firstly, a simple and highly efficient ultra-high-performance liquid chromatography coupled to quadrupole time-of-flight mass spectrometry (UHPLC-QTOF MSE) method was established to analyze the FYJG and ten single herbs. High-accuracy MS/MS data were acquired under two collision energies using high-definition MSE in the negative and positive modes. Secondly, a multistage intelligent data annotation strategy was developed and used to rapidly screen out and identify the compounds of FYJG, which was integrated with various online software and data processing platforms. The in-house chemical library of 2949 compounds was created and operated in the UNIFI software to enable automatic peak annotation of the MSE data. Then, the acquired MS data were processed by MS-DIAL, and a feature-based molecular networking (FBMN) was constructed on the Global Natural Product Social Molecular Networking (GNPS) to infer potential compositions of FYJG by rapidly classifying and visualizing. It was simultaneously using the MZmine software to recognize the source attribution of ingredients. On this basis, the unique chemical categories and characteristics of herbaceous plant species are utilized further to verify the accuracy of the source attribution of multi-components. This comprehensive analysis successfully identified or tentatively characterized 279 compounds in FYJG, including flavonoids, phenolic acids, coumarins, saponins, alkaloids, lignans, and phenylethanoids. Notably, twelve indole alkaloids and four organic acids from Isatidis Folium were characterized in this formula for the first time. This study demonstrates a potential superiority to identify compounds in complex TCM formulas using high-definition MSE and computer software-assisted structural analysis tools, which can obtain high-quality MS/MS spectra, effectively distinguish isomers, and improve the coverage of trace components. This study elucidates the various components and sources of FYJG and provides a theoretical basis for its further clinical development and application.
The extracellular matrix (ECM) is a complex meshwork of proteins that forms the scaffold of all tissues in multicellular organisms. It plays critical roles in all aspects of life: from orchestrating ...cell migration during development, to supporting tissue repair. It also plays critical roles in the etiology or progression of diseases. To study this compartment, we defined the compendium of all genes encoding ECM and ECM-associated proteins for multiple organisms. We termed this compendium the "matrisome" and further classified matrisome components into different structural or functional categories. This nomenclature is now largely adopted by the research community to annotate -omics datasets and has contributed to advance both fundamental and translational ECM research. Here, we report the development of Matrisome AnalyzeR, a suite of tools including a web-based application (https://sites.google.com/uic.edu/matrisome/tools/matrisome-analyzer) and an R package (https://github.com/Matrisome/MatrisomeAnalyzeR). The web application can be used by anyone interested in annotating, classifying, and tabulating matrisome molecules in large datasets without requiring programming knowledge. The companion R package is available to more experienced users, interested in processing larger datasets or in additional data visualization options.
Funding agencies play a pivotal role in bolstering research endeavors by allocating financial resources for data collection and analysis. However, the lack of detailed information regarding the ...methods employed for data gathering and analysis can obstruct the replication and utilization of the results, ultimately affecting the study’s transparency and integrity. The task of manually annotating extensive datasets demands considerable labor and financial investment, especially when it entails engaging specialized individuals. In our crowd counting study, we employed the web-based annotation tool SuperAnnotate to streamline the human annotation process for a dataset comprising 3,000 images. By integrating automated annotation tools, we realized substantial time efficiencies, as demonstrated by the remarkable achievement of 858,958 annotations. This underscores the significant contribution of such technologies to the efficiency of the annotation process.
The Remote Sensing (RS) field has an increasing research interest in using deep learning (DL) models to recognize kinds of RS data, leading to a great demand for training data annotation. Due to the ...high cost of expertise, employing non-experts to label data has become an important way to improve labeling efficiency. Commonly, a single data sample is labeled by multiple annotators and the most voted label is accepted to promise accuracy. But in the RS context, the widely admitted strategy could lose effect. Usually RS data involves considerable classes on account of the complexity of surface environments, which is prone to inter-class similarity difficult to distinguish. Annotators without expertise probably make mistakes on these indistinguishable classes, thus causing error voted labels. Although classification of different characteristics in RS data have been widely documented, the non-expert annotators are unfamiliar with these expertise, and it is difficult to force them to handle specialized labeling skills. To address the issues, this paper bases multi-annotator label selection on the investigation of annotators' own ability in distinguishing similar classes of images. A quality evaluation process is designed which weights the labels from capable annotators higher than those from weak ones. By a multi-round quality evaluation algorithm, correct labels could out-compete the wrong ones even disadvantaged in numbers. Experimental results demonstrate the advance of the proposed method on RS datasets.
Between Subjectivity and Imposition Miceli, Milagros; Schuessler, Martin; Yang, Tianling
Proceedings of the ACM on human-computer interaction,
10/2020, Letnik:
4, Številka:
CSCW2
Journal Article
Recenzirano
The interpretation of data is fundamental to machine learning. This paper investigates practices of image data annotation as performed in industrial contexts. We define data annotation as a ...sense-making practice, where annotators assign meaning to data through the use of labels. Previous human-centered investigations have largely focused on annotators? subjectivity as a major cause of biased labels. We propose a wider view on this issue: guided by constructivist grounded theory, we conducted several weeks of fieldwork at two annotation companies. We analyzed which structures, power relations, and naturalized impositions shape the interpretation of data. Our results show that the work of annotators is profoundly informed by the interests, values, and priorities of other actors above their station. Arbitrary classifications are vertically imposed on annotators, and through them, on data. This imposition is largely naturalized. Assigning meaning to data is often presented as a technical matter. This paper shows it is, in fact, an exercise of power with multiple implications for individuals and society.
The results of high-throughput experiments consist of numerous candidate genes, proteins, or other molecules potentially associated with diseases. A challenge for omics science is the knowledge ...extraction from the results and the filtering of promising gene or protein candidates. Especially, the hot topic in clinical scenarios consists of highlighting the behavior of few molecules related to some specific disease. In this contest, different computational approaches, also referred Gene prioritization methods, ensure to identify the most related genes to a disease among a larger set of candidate genes. The identification requires the use of domain-specific knowledge that is often encoded into ontologies.
•Pseudo-labeling as a strategy to annotate unsupervised samples from only 1% of supervised samples.•A semi-supervised pseudo-labeling approach to reduce human effort in data annotation.•Learning ...features and labels from multiple pseudo-labeling iterations.•Exploiting 2D non-linear projections and connectivity-based information for data annotation.•Clustering-based metric to select the optimal-learned model for generalization.
Display omitted
The absence of large annotated datasets to train deep neural networks (DNNs) is an issue since manual annotation is time-consuming, expensive, and error-prone. Semi-supervised learning techniques can address the problem propagating pseudo labels from supervised to unsupervised samples. However, they still require training and validation sets with many supervised samples. This work proposes a methodology, namely Deep Feature Annotation (DeepFA), that dismisses the validation set and uses very few supervised samples (e.g., 1% of the dataset). DeepFA modifies the feature spaces of a DNN along with meta-pseudo-labeling iterations in a 2D non-linear projection space using the most confidently labeled samples of an optimum-path forest semi-supervised classifier. We present a comprehensive study on DeepFA and a new variant that detects the best DNN model for generalization during the pseudo-labeling iterations. We evaluate components of DeepFA on eight datasets, finding the best DeepFA approach and showing that it outperforms self-pseudo-labeling.
•Feature space projections to increase semi-supervised learning.•Data annotation using feature space projections outperforms automatic methods.•Combining automatic and user-driven labeling methods ...improves annotation and classification results.•Confidence measures reduce human labeling effort as compared to fully-manual labeling.
Display omitted
Data annotation using visual inspection (supervision) of each training sample can be laborious. Interactive solutions alleviate this by helping experts propagate labels from a few supervised samples to unlabeled ones based solely on the visual analysis of their feature space projection (with no further sample supervision). We present a semi-automatic data annotation approach based on suitable feature space projection and semi-supervised label estimation. We validate our method on the popular MNIST dataset and on images of human intestinal parasites with and without fecal impurities, a large and diverse dataset that makes classification very hard. We evaluate two approaches for semi-supervised learning from the latent and projection spaces, to choose the one that best reduces user annotation effort and also increases classification accuracy on unseen data. Our results demonstrate the added-value of visual analytics tools that combine complementary abilities of humans and machines for more effective machine learning.