Urinary tract infection (UTI) is a common emergency department (ED) diagnosis with reported high diagnostic error rates. Because a urine culture, part of the gold standard for diagnosis of UTI, is ...usually not available for 24-48 hours after an ED visit, diagnosis and treatment decisions are based on symptoms, physical findings, and other laboratory results, potentially leading to overutilization, antibiotic resistance, and delayed treatment. Previous research has demonstrated inadequate diagnostic performance for both individual laboratory tests and prediction tools.
Our aim, was to train, validate, and compare machine-learning based predictive models for UTI in a large diverse set of ED patients.
Single-center, multi-site, retrospective cohort analysis of 80,387 adult ED visits with urine culture results and UTI symptoms. We developed models for UTI prediction with six machine learning algorithms using demographic information, vitals, laboratory results, medications, past medical history, chief complaint, and structured historical and physical exam findings. Models were developed with both the full set of 211 variables and a reduced set of 10 variables. UTI predictions were compared between models and to proxies of provider judgment (documentation of UTI diagnosis and antibiotic administration).
The machine learning models had an area under the curve ranging from 0.826-0.904, with extreme gradient boosting (XGBoost) the top performing algorithm for both full and reduced models. The XGBoost full and reduced models demonstrated greatly improved specificity when compared to the provider judgment proxy of UTI diagnosis OR antibiotic administration with specificity differences of 33.3 (31.3-34.3) and 29.6 (28.5-30.6), while also demonstrating superior sensitivity when compared to documentation of UTI diagnosis with sensitivity differences of 38.7 (38.1-39.4) and 33.2 (32.5-33.9). In the admission and discharge cohorts using the full XGboost model, approximately 1 in 4 patients (4109/15855) would be re-categorized from a false positive to a true negative and approximately 1 in 11 patients (1372/15855) would be re-categorized from a false negative to a true positive.
The best performing machine learning algorithm, XGBoost, accurately diagnosed positive urine culture results, and outperformed previously developed models in the literature and several proxies for provider judgment. Future prospective validation is warranted.
Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, ...such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.
Selective or Multiple Reaction monitoring (SRM/MRM) is a liquid-chromatography (LC)/tandem-mass spectrometry (MS/MS) method that enables the quantitation of specific proteins in a sample by analyzing ...precursor ions and the fragment ions of their selected tryptic peptides. Instrumentation software has advanced to the point that thousands of transitions (pairs of primary and secondary m/z values) can be measured in a triple quadrupole instrument coupled to an LC, by a well-designed scheduling and selection of m/z windows. The design of a good MRM assay relies on the availability of peptide spectra from previous discovery-phase LC–MS/MS studies. The tedious aspect of manually developing and processing MRM assays involving thousands of transitions has spurred to development of software tools to automate this process. Software packages have been developed for project management, assay development, assay validation, data export, peak integration, quality assessment, and biostatistical analysis. No single tool provides a complete end-to-end solution, thus this article reviews the current state and discusses future directions of these software tools in order to enable researchers to combine these tools for a comprehensive targeted proteomics workflow.
Abstract
Background
Many systems biology studies leverage the integration of multiple data types (across different data sources) to offer a more comprehensive view of the biological system being ...studied. While SQL (Structured Query Language) databases are popular in the biomedical domain, NoSQL database technologies have been used as a more relationship-based, flexible and scalable method of data integration.
Results
We have created a graph database integrating data from multiple sources. In addition to using a graph-based query language (Cypher) for data retrieval, we have developed a web-based dashboard that allows users to easily browse and plot data without the need to learn Cypher. We have also implemented a visual graph query interface for users to browse graph data. Finally, we have built a prototype to allow the user to query the graph database in natural language.
Conclusion
We have demonstrated the feasibility and flexibility of using a graph database for storing and querying immunological data with complex biological relationships. Querying a graph database through such relationships has the potential to discover novel relationships among heterogeneous biological data and metadata.
The center for expanded data annotation and retrieval Musen, Mark A; Bean, Carol A; Cheung, Kei-Hoi ...
Journal of the American Medical Informatics Association : JAMIA,
11/2015, Volume:
22, Issue:
6
Journal Article
Peer reviewed
Open access
The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and ...data reuse. We take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments.
Cognitive impairment is a frequent and serious problem in patients with various forms of severe mental illnesses (SMI), including schizophrenia (SZ) and bipolar disorder (BP). Recent research ...suggests genetic links to several cognitive phenotypes in both SMI and in the general population. Our goal in this study was to identify potential genomic signatures of cognitive functioning in veterans with severe mental illness and compare them to previous findings for cognition across different populations. Veterans Affairs (VA) Cooperative Studies Program (CSP) Study #572 evaluated cognitive and functional capacity measures among SZ and BP patients. In conjunction with the VA Million Veteran Program, 3,959 European American (1,095 SZ, 2,864 BP) and 2,601 African American (1,095 SZ, 2,864 BP) patients were genotyped using a custom Affymetrix Axiom Biobank array. We performed a genome‐wide association study of global cognitive functioning, constructed polygenic scores for SZ and cognition in the general population, and examined genetic correlations with 2,626 UK Biobank traits. Although no single locus attained genome‐wide significance, observed allelic effects were strongly consistent with previous studies. We observed robust associations between global cognitive functioning and polygenic scores for cognitive performance, intelligence, and SZ risk. We also identified significant genetic correlations with several cognition‐related traits in UK Biobank. In a diverse cohort of U.S. veterans with SZ or BP, we demonstrate broad overlap of common genetic effects on cognition in the general population, and find that greater polygenic loading for SZ risk is associated with poorer cognitive performance.
Chronic obstructive pulmonary disease (COPD) is associated with poor quality of life, hospitalization and mortality. COPD phenotype includes using pulmonary function tests to determine airflow ...obstruction from the forced expiratory volume in one second (FEV1):forced vital capacity. FEV1 is a commonly used value for severity but is difficult to identify in structured electronic health record (EHR) data.
Using the Microsoft SQL Server's full-text search feature and string functions supporting regular-expression-like operations, we developed an automated tool to extract FEV1 values from progress notes to improve ascertainment of FEV1 in EHR in the Veterans Aging Cohort Study (VACS).
The automated tool increased quantifiable FEV1 values from 12,425 to 16,274 (24% increase in numeric FEV1). Using chart review as the reference, positive predictive value of the tool was 99% (95% Confidence interval: 98.2-100.0%) for identifying quantifiable FEV1 values and a recall value of 100%, yielding an F-measure of 0.99. The tool correctly identified FEV1 measurements in 95% of cases.
A SQL-based full text search of clinical notes for quantifiable FEV1 is efficient and improves the number of values available in VA data. Future work will examine how these methods can improve phenotyping of patients with COPD in the VA.
Human immunology studies often rely on the isolation and quantification of cell populations from an input sample based on flow cytometry and related techniques. Such techniques classify cells into ...populations based on the detection of a pattern of markers. The description of the cell populations targeted in such experiments typically have two complementary components: the description of the cell type targeted (e.g. 'T cells'), and the description of the marker pattern utilized (e.g. CD14-, CD3+).
We here describe our attempts to use ontologies to cross-compare cell types and marker patterns (also referred to as gating definitions). We used a large set of such gating definitions and corresponding cell types submitted by different investigators into ImmPort, a central database for immunology studies, to examine the ability to parse gating definitions using terms from the Protein Ontology (PRO) and cell type descriptions, using the Cell Ontology (CL). We then used logical axioms from CL to detect discrepancies between the two.
We suggest adoption of our proposed format for describing gating and cell type definitions to make comparisons easier. We also suggest a number of new terms to describe gating definitions in flow cytometry that are not based on molecular markers captured in PRO, but on forward- and side-scatter of light during data acquisition, which is more appropriate to capture in the Ontology for Biomedical Investigations (OBI). Finally, our approach results in suggestions on what logical axioms and new cell types could be considered for addition to the Cell Ontology.
Human cancer is caused by the accumulation of somatic mutations in tumor suppressors and oncogenes within the genome. In the case of oncogenes, recent theory suggests that there are only a few key ..."driver" mutations responsible for tumorigenesis. As there have been significant pharmacological successes in developing drugs that treat cancers that carry these driver mutations, several methods that rely on mutational clustering have been developed to identify them. However, these methods consider proteins as a single strand without taking their spatial structures into account. We propose an extension to current methodology that incorporates protein tertiary structure in order to increase our power when identifying mutation clustering.
We have developed iPAC (identification of Protein Amino acid Clustering), an algorithm that identifies non-random somatic mutations in proteins while taking into account the three dimensional protein structure. By using the tertiary information, we are able to detect both novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of clustering based on existing methods. For example, by combining the data in the Protein Data Bank (PDB) and the Catalogue of Somatic Mutations in Cancer, our algorithm identifies new mutational clusters in well known cancer proteins such as KRAS and PI3KC α. Further, by utilizing the tertiary structure, our algorithm also identifies clusters in EGFR, EIF2AK2, and other proteins that are not identified by current methodology. The R package is available at: http://www.bioconductor.org/packages/2.12/bioc/html/iPAC.html.
Our algorithm extends the current methodology to identify oncogenic activating driver mutations by utilizing tertiary protein structure when identifying nonrandom somatic residue mutation clusters.
A fundamental goal of the U.S. National Institute of Health (NIH) "Roadmap" is to strengthen Translational Research, defined as the movement of discoveries in basic research to application at the ...clinical level. A significant barrier to translational research is the lack of uniformly structured data across related biomedical domains. The Semantic Web is an extension of the current Web that enables navigation and meaningful use of digital resources by automatic processes. It is based on common formats that support aggregation and integration of data drawn from diverse sources. A variety of technologies have been built on this foundation that, together, support identifying, representing, and reasoning across a wide range of biomedical data. The Semantic Web Health Care and Life Sciences Interest Group (HCLSIG), set up within the framework of the World Wide Web Consortium, was launched to explore the application of these technologies in a variety of areas. Subgroups focus on making biomedical data available in RDF, working with biomedical ontologies, prototyping clinical decision support systems, working on drug safety and efficacy communication, and supporting disease researchers navigating and annotating the large amount of potentially relevant literature.
We present a scenario that shows the value of the information environment the Semantic Web can support for aiding neuroscience researchers. We then report on several projects by members of the HCLSIG, in the process illustrating the range of Semantic Web technologies that have applications in areas of biomedicine.
Semantic Web technologies present both promise and challenges. Current tools and standards are already adequate to implement components of the bench-to-bedside vision. On the other hand, these technologies are young. Gaps in standards and implementations still exist and adoption is limited by typical problems with early technology, such as the need for a critical mass of practitioners and installed base, and growing pains as the technology is scaled up. Still, the potential of interoperable knowledge sources for biomedicine, at the scale of the World Wide Web, merits continued work.