Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes ...of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators. * Learn general methods for specifying Big Data in a way that is understandable to humans and to computers * Avoid the pitfalls in Big Data design and analysis * Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources
Abstract
WikiPathways (wikipathways.org) captures the collective knowledge represented in biological pathways. By providing a database in a curated, machine readable way, omics data analysis and ...visualization is enabled. WikiPathways and other pathway databases are used to analyze experimental data by research groups in many fields. Due to the open and collaborative nature of the WikiPathways platform, our content keeps growing and is getting more accurate, making WikiPathways a reliable and rich pathway database. Previously, however, the focus was primarily on genes and proteins, leaving many metabolites with only limited annotation. Recent curation efforts focused on improving the annotation of metabolism and metabolic pathways by associating unmapped metabolites with database identifiers and providing more detailed interaction knowledge. Here, we report the outcomes of the continued growth and curation efforts, such as a doubling of the number of annotated metabolite nodes in WikiPathways. Furthermore, we introduce an OpenAPI documentation of our web services and the FAIR (Findable, Accessible, Interoperable and Reusable) annotation of resources to increase the interoperability of the knowledge encoded in these pathways and experimental omics data. New search options, monthly downloads, more links to metabolite databases, and new portals make pathway knowledge more effortlessly accessible to individual researchers and research communities.
This open access book constitutes selected papers presented during the 30th Irish Conference on Artificial Intelligence and Cognitive Science, held in Munster, Ireland, in December 2022. The 41 ...presented papers were thoroughly reviewed and selected from the 102 submissions. They are organized in topical sections on machine learning, deep learning and applications; responsible and trustworthy artificial intelligence; natural language processing and recommender systems; knowledge representation, reasoning, optimisation and intelligent applications.
Lignin is a heterogeneous aromatic biopolymer and a major constituent of lignocellulosic biomass, such as wood and agricultural residues. Despite the high amount of aromatic carbon present, the ...severe recalcitrance of the lignin macromolecule makes it difficult to convert into value-added products. In nature, lignin and lignin-derived aromatic compounds are catabolized by a consortia of microbes specialized at breaking down the natural lignin and its constituents. In an attempt to bridge the gap between the fundamental knowledge on microbial lignin catabolism, and the recently emerging field of applied biotechnology for lignin biovalorization, we have developed the eLignin Microbial Database (
www.elignindatabase.com
), an openly available database that indexes data from the lignin bibliome, such as microorganisms, aromatic substrates, and metabolic pathways. In the present contribution, we introduce the eLignin database, use its dataset to map the reported ecological and biochemical diversity of the lignin microbial niches, and discuss the findings.
This open access book constitutes the proceedings of the 24th International Conference on Foundations of Software Science and Computational Structures, FOSSACS 2021, which was held during March 27 ...until April 1, 2021, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021. The conference was planned to take place in Luxembourg and changed to an online format due to the COVID-19 pandemic. The 28 regular papers presented in this volume were carefully reviewed and selected from 88 submissions. They deal with research on theories and methods to support the analysis, integration, synthesis, transformation, and verification of programs and software systems.
The Inorganic Crystal Structure Database (ICSD) is the world's largest database of fully evaluated and published crystal structure data, mostly obtained from experimental results. However, the purely ...experimental approach is no longer the only route to discover new compounds and structures. In the past few decades, numerous computational methods for simulating and predicting structures of inorganic solids have emerged, creating large numbers of theoretical crystal data. In order to take account of these new developments the scope of the ICSD was extended in 2017 to include theoretical structures which are published in peer‐reviewed journals. Each theoretical structure has been carefully evaluated, and the resulting CIF has been extended and standardized. Furthermore, a first classification of theoretical data in the ICSD is presented, including additional categories used for comparison of experimental and theoretical information.
The article discusses how theoretical crystal data are supplementing experimental data for simulation and prediction of structures of inorganic solids in the Inorganic Crystal Structure Database.
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations ...and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
The MetaCyc database (MetaCyc.org) is a freely accessible comprehensive database describing metabolic pathways and enzymes from all domains of life. The majority of MetaCyc pathways are ...small-molecule metabolic pathways that have been experimentally determined. MetaCyc contains more than 2400 pathways derived from >46,000 publications, and is the largest curated collection of metabolic pathways. BioCyc (BioCyc.org) is a collection of 5700 organism-specific Pathway/Genome Databases (PGDBs), each containing the full genome and predicted metabolic network of one organism, including metabolites, enzymes, reactions, metabolic pathways, predicted operons, transport systems, and pathway-hole fillers. The BioCyc website offers a variety of tools for querying and analyzing PGDBs, including Omics Viewers and tools for comparative analysis. This article provides an update of new developments in MetaCyc and BioCyc during the last two years, including addition of Gibbs free energy values for compounds and reactions; redesign of the primary gene/protein page; addition of a tool for creating diagrams containing multiple linked pathways; several new search capabilities, including searching for genes based on sequence patterns, searching for databases based on an organism's phenotypes, and a cross-organism search; and a metabolite identifier translation service.
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between chemicals and gene products, and their relationships to diseases. Core CTD content ...(chemical-gene, chemical-disease and gene-disease interactions manually curated from the literature) are integrated with each other as well as with select external datasets to generate expanded networks and predict novel associations. Today, core CTD includes more than 30.5 million toxicogenomic connections relating chemicals/drugs, genes/proteins, diseases, taxa, Gene Ontology (GO) annotations, pathways, and gene interaction modules. In this update, we report a 33% increase in our core data content since 2015, describe our new exposure module (that harmonizes exposure science information with core toxicogenomic data) and introduce a novel dataset of GO-disease inferences (that identify common molecular underpinnings for seemingly unrelated pathologies). These advancements centralize and contextualize real-world chemical exposures with molecular pathways to help scientists generate testable hypotheses in an effort to understand the etiology and mechanisms underlying environmentally influenced diseases.