Abstract
MetaCyc (https://MetaCyc.org) is a comprehensive reference database of metabolic pathways and enzymes from all domains of life. It contains more than 2570 pathways derived from >54 000 ...publications, making it the largest curated collection of metabolic pathways. The data in MetaCyc is strictly evidence-based and richly curated, resulting in an encyclopedic reference tool for metabolism. MetaCyc is also used as a knowledge base for generating thousands of organism-specific Pathway/Genome Databases (PGDBs), which are available in the BioCyc (https://BioCyc.org) and other PGDB collections. This article provides an update on the developments in MetaCyc during the past two years, including the expansion of data and addition of new features.
New technologies are revolutionising biological research and its applications by making it easier and cheaper to generate ever-greater volumes and types of data. In response, the services and ...infrastructure of the European Bioinformatics Institute (EMBL-EBI, www.ebi.ac.uk) are continually expanding: total disk capacity increases significantly every year to keep pace with demand (75 petabytes as of December 2015), and interoperability between resources remains a strategic priority. Since 2014 we have launched two new resources: the European Variation Archive for genetic variation data and EMPIAR for two-dimensional electron microscopy data, as well as a Resource Description Framework platform. We also launched the Embassy Cloud service, which allows users to run large analyses in a virtual environment next to EMBL-EBI's vast public data resources.
A1: A Distributed In-Memory Graph Database Buragohain, Chiranjeeb; Risvik, Knut Magne; Brett, Paul ...
Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data,
06/2020
Conference Proceeding
Odprti dostop
A1 is an in-memory distributed database used by the Bing search engine to support complex queries over structured data. The key enablers for A1 are availability of cheap DRAM and high speed RDMA ...(Remote Direct Memory Access) networking in commodity hardware. A1 uses FaRM 11,12 as its underlying storage layer and builds the graph abstraction and query engine on top. The combination of in-memory storage and RDMA access requires rethinking how data is allocated, organized and queried in a large distributed system. A single A1 cluster can store tens of billions of vertices and edges and support a throughput of 350+ million of vertex reads per second with end to end query latency in single digit milliseconds. In this paper we describe the A1 data model, RDMA optimized data structures and query execution.
Abstract
Thirty years have elapsed since the emergence of the classification of carbohydrate-active enzymes in sequence-based families that became the CAZy database over 20 years ago, freely ...available for browsing and download at www.cazy.org. In the era of large scale sequencing and high-throughput Biology, it is important to examine the position of this specialist database that is deeply rooted in human curation. The three primary tasks of the CAZy curators are (i) to maintain and update the family classification of this class of enzymes, (ii) to classify sequences newly released by GenBank and the Protein Data Bank and (iii) to capture and present functional information for each family. The CAZy website is updated once a month. Here we briefly summarize the increase in novel families and the annotations conducted during the last 8 years. We present several important changes that facilitate taxonomic navigation, and allow to download the entirety of the annotations. Most importantly we highlight the considerable amount of work that accompanies the analysis and report of biochemical data from the literature.
Information on the size of academic search engines and bibliographic databases (ASEBDs) is often outdated or entirely unavailable. Hence, it is difficult to assess the scope of specific databases, ...such as Google Scholar. While scientometric studies have estimated ASEBD sizes before, the methods employed were able to compare only a few databases. Consequently, there is no up-to-date comparative information on the sizes of popular ASEBDs. This study aims to fill this blind spot by providing a comparative picture of 12 of the most commonly used ASEBDs. In doing so, we build on and refine previous scientometric research by counting query hit data as an indicator of the number of accessible records. Iterative query optimization makes it possible to identify a maximum number of hits for most ASEBDs. The results were validated in terms of their capacity to assess database size by comparing them with official information on database sizes or previous scientometric studies. The queries used here are replicable, so size information can be updated quickly. The findings provide first-time size estimates of ProQuest and EbscoHost and indicate that Google Scholar’s size might have been underestimated so far by more than 50%. By our estimation Google Scholar, with 389 million records, is currently the most comprehensive academic search engine.
Moving Objects Databases is the first uniform treatment of moving objects databases, the technology that supports GPS and RFID. It focuses on the modeling and design of data from moving objects — ...such as people, animals, vehicles, hurricanes, forest fires, oil spills, armies, or other objects — as well as the storage, retrieval, and querying of that very voluminous data.It includes homework assignments at the end of each chapter, exercises throughout the text that students can complete as they read, and a solutions manual in the back of the book.This book is intended for graduate or advanced undergraduate students. It is also recommended for computer scientists and database systems engineers and programmers in government, industry and academia; professionals from other disciplines, e.g., geography, geology, soil science, hydrology, urban and regional planning, mobile computing, bioterrorism and homeland security, etc.Focuses on the modeling and design of data from moving objects--such as people, animals, vehicles, hurricanes, forest fires, oil spills, armies, or other objects--as well as the storage, retrieval, and querying of that very voluminous data.Demonstrates through many practical examples and illustrations how new concepts and techniques are used to integrate time and space in database applications.Provides exercises and solutions in each chapter to enable the reader to explore recent research results in practice.
Numerous irregular graph datasets, for example social networks or web graphs, may contain even trillions of edges. Often, their structure changes over time and they have domain-specific rich data ...associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size and irregularity of such datasets, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., document stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., Resource Description Framework or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and Atomicity, Consistency, Isolation, Durability). Fifty-one graph database systems are presented and compared, including Neo4j, OrientDB, and Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we outline future research and engineering challenges related to graph databases.
Abstract
The Reactome Knowledgebase (https://reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network ...of molecular transformations in a single consistent data model, an extended version of a classic metabolic map. Reactome functions both as an archive of biological processes and as a tool for discovering functional relationships in data such as gene expression profiles or somatic mutation catalogs from tumor cells. To extend our ability to annotate human disease processes, we have implemented a new drug class and have used it initially to annotate drugs relevant to cardiovascular disease. Our annotation model depends on external domain experts to identify new areas for annotation and to review new content. New web pages facilitate recruitment of community experts and allow those who have contributed to Reactome to identify their contributions and link them to their ORCID records. To improve visualization of our content, we have implemented a new tool to automatically lay out the components of individual reactions with multiple options for downloading the reaction diagrams and associated data, and a new display of our event hierarchy that will facilitate visual interpretation of pathway analysis results.
ABSTRACT
TP53 gene mutations are one of the most frequent somatic events in cancer. The IARC TP53 Database (http://p53.iarc.fr) is a popular resource that compiles occurrence and phenotype data on ...TP53 germline and somatic variations linked to human cancer. The deluge of data coming from cancer genomic studies generates new data on TP53 variations and attracts a growing number of database users for the interpretation of TP53 variants. Here, we present the current contents and functionalities of the IARC TP53 Database and perform a systematic analysis of TP53 somatic mutation data extracted from this database and from genomic data repositories. This analysis showed that IARC has more TP53 somatic mutation data than genomic repositories (29,000 vs. 4,000). However, the more complete screening achieved by genomic studies highlighted some overlooked facts about TP53 mutations, such as the presence of a significant number of mutations occurring outside the DNA‐binding domain in specific cancer types. We also provide an update on TP53 inherited variants including the ones that should be considered as neutral frequent variations. We thus provide an update of current knowledge on TP53 variations in human cancer as well as inform users on the efficient use of the IARC TP53 Database.
Here we present an update on the IARC TP53 Database contents and features (p53.iarc.fr), and perform a systematic analysis of TP53 somatic mutation data extracted from this database and from genomic data repositories. The IARC database has more TP53 somatic mutation data than genomic repositories (29,000 versus 4,000). Genome‐wide studies confirmed that TP53 is the most frequently mutated cancer genes and revealed the presence of a significant number of mutations outside the DNA‐binding domain in specific cancer types.