In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of ...data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
There is a huge demand on bioinformaticians to provide their biologists with user friendly and scalable software infrastructures to capture, exchange, and exploit the unprecedented amounts of new ...*omics data. We here present MOLGENIS, a generic, open source, software toolkit to quickly produce the bespoke MOLecular GENetics Information Systems needed.
The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological data structures and user interfaces. At the push of a button, MOLGENIS' generator suite automatically translates these models into a feature-rich, ready-to-use web application including database, user interfaces, exchange formats, and scriptable interfaces. Each generator is a template of SQL, JAVA, R, or HTML code that would require much effort to write by hand. This 'model-driven' method ensures reuse of best practices and improves quality because the modeling language and generators are shared between all MOLGENIS applications, so that errors are found quickly and improvements are shared easily by a re-generation. A plug-in mechanism ensures that both the generator suite and generated product can be customized just as much as hand-written software.
In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of many types of biomedical applications, including next-generation sequencing, GWAS, QTL, proteomics and biobanking. Writing 500 lines of model XML typically replaces 15,000 lines of hand-written programming code, which allows for quick adaptation if the information system is not yet to the biologist's satisfaction. Each application generated with MOLGENIS comes with an optimized database back-end, user interfaces for biologists to manage and exploit their data, programming interfaces for bioinformaticians to script analysis tools in R, Java, SOAP, REST/JSON and RDF, a tab-delimited file format to ease upload and exchange of data, and detailed technical documentation. Existing databases can be quickly enhanced with MOLGENIS generated interfaces using the 'ExtractModel' procedure.
The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexible web platforms for all possible genomic, molecular and phenotypic experiments with a richness of interfaces not provided by other tools. All the software and manuals are available free as LGPLv3 open source at http://www.molgenis.org.
ABSTRACT
The Finnish Disease Heritage Database (FinDis) (http://findis.org) was originally published in 2004 as a centralized information resource for rare monogenic diseases enriched in the Finnish ...population. The FinDis database originally contained 405 causative variants for 30 diseases. At the time, the FinDis database was a comprehensive collection of data, but since 1994, a large amount of new information has emerged, making the necessity to update the database evident. We collected information and updated the database to contain genes and causative variants for 35 diseases, including six more genes and more than 1,400 additional disease‐causing variants. Information for causative variants for each gene is collected under the LOVD 3.0 platform, enabling easy updating. The FinDis portal provides a centralized resource and user interface to link information on each disease and gene with variant data in the LOVD 3.0 platform. The software written to achieve this has been open‐sourced and made available on GitHub (http://github.com/findis‐db), allowing biomedical institutions in other countries to present their national data in a similar way, and to both contribute to, and benefit from, standardized variation data. The updated FinDis portal provides a unique resource to assist patient diagnosis, research, and the development of new cures.
The Finnish disease heritage refers to a group of rare monogenic diseases that are, by definition, more prevalent in Finland than elsewhere in the world. In the work data of the 36 diseases were updated and made available from a web portal (http://findis.org) based on LOVD instances, which are provided as a service for curators by the Leiden University Medical Center.
Sharing of data about variation and the associated phenotypes is a critical need, yet variant information can be arbitrarily complex, making a single standard vocabulary elusive and re-formatting ...difficult. Complex standards have proven too time-consuming to implement.
The GEN2PHEN project addressed these difficulties by developing a comprehensive data model for capturing biomedical observations, Observ-OM, and building the VarioML format around it. VarioML pairs a simplified open specification for describing variants, with a toolkit for adapting the specification into one's own research workflow. Straightforward variant data can be captured, federated, and exchanged with no overhead; more complex data can be described, without loss of compatibility. The open specification enables push-button submission to gene variant databases (LSDBs) e.g., the Leiden Open Variation Database, using the Cafe Variome data publishing service, while VarioML bidirectionally transforms data between XML and web-application code formats, opening up new possibilities for open source web applications building on shared data. A Java implementation toolkit makes VarioML easily integrated into biomedical applications. VarioML is designed primarily for LSDB data submission and transfer scenarios, but can also be used as a standard variation data format for JSON and XML document databases and user interface components.
VarioML is a set of tools and practices improving the availability, quality, and comprehensibility of human variation information. It enables researchers, diagnostic laboratories, and clinics to share that information with ease, clarity, and without ambiguity.
Genome-wide association analysis on monozygotic twin-pairs offers a route to discovery of gene environment interactions through testing for variability loci associated with sensitivity to individual ...environment/lifestyle. We present a genome-wide scan of loci associated with intra-pair differences in serum lipid and apolipoprotein levels. We report data for 1,720 monozygotic female twin-pairs from GenomEUtwin project with 2.5 million SNPs, imputed or genotyped, and measured serum lipid fractions for both twins. We found one locus associated with intra-pair differences in high-density lipoprotein cholesterol, rs2483058 in an intron of SRGAP2, where twins carrying the C allele are more sensitive to environmental factors(P=3.98 x 10-8). We followed up the association in further genotyped monozygotic twins (N= 1,261),which showed a moderate association for the variant (P= 0.200, same direction of an effect). In addition,we report a new association on the level of apolipoprotein A-ll (P= 4.03 x 1 o-8).
A unique genetic background in an isolated population like that of Finland offers special opportunities for genetic research as well as for applying the genetic developments to the health care. On ...the other hand, the different genetic background may require local attempts to develop diagnostics and treatment as the selection of diseases and mutations differs from that in the other populations. In this review, we describe the experiences of research and health care in this genetic isolate starting from the identification of specific monogenic diseases enriched in the Finnish population all the way to implementing the knowledge of the unique genetic background to genomic medicine at population level.
Temporal lobe epilepsy is the most common drug-resistant form of epilepsy in adults. The reorganization of neural networks and the gene expression landscape underlying pathophysiologic network ...behavior in brain structures such as the hippocampus has been suggested to be controlled, in part, by microRNAs. To systematically assess their significance, we sequenced Argonaute-loaded microRNAs to define functionally engaged microRNAs in the hippocampus of three different animal models in two species and at six time points between the initial precipitating insult through to the establishment of chronic epilepsy. We then selected commonly up-regulated microRNAs for a functional in vivo therapeutic screen using oligonucleotide inhibitors. Argonaute sequencing generated 1.44 billion small RNA reads of which up to 82% were microRNAs, with over 400 unique microRNAs detected per model. Approximately half of the detected microRNAs were dysregulated in each epilepsy model. We prioritized commonly up-regulated microRNAs that were fully conserved in humans and designed custom antisense oligonucleotides for these candidate targets. Antiseizure phenotypes were observed upon knockdown of miR-10a-5p, miR-21a-5p, and miR-142a-5p and electrophysiological analyses indicated broad safety of this approach. Combined inhibition of these three microRNAs reduced spontaneous seizures in epileptic mice. Proteomic data, RNA sequencing, and pathway analysis on predicted and validated targets of these microRNAs implicated derepressed TGF-β signaling as a shared seizure-modifying mechanism. Correspondingly, inhibition of TGF-β signaling occluded the antiseizure effects of the antagomirs. Together, these results identify shared, dysregulated, and functionally active microRNAs during the pathogenesis of epilepsy which represent therapeutic antiseizure targets.
The flow of research data concerning the genetic basis of health and disease is rapidly increasing in speed and complexity. In response, many projects are seeking to ensure that there are appropriate ...informatics tools, systems and databases available to manage and exploit this flood of information. Previous solutions, such as central databases, journal-based publication and manually intensive data curation, are now being enhanced with new systems for federated databases, database publication, and more automated management of data flows and quality control. Along with emerging technologies that enhance connectivity and data retrieval, these advances should help to create a powerful knowledge environment for genotype-phenotype information.
FINDbase (http://www.findbase.org) is a comprehensive data repository that records the prevalence of clinically relevant genomic variants in various populations worldwide, such as pathogenic variants ...leading mostly to monogenic disorders and pharmacogenomics biomarkers. The database also records the incidence of rare genetic diseases in various populations, all in well-distinct data modules. Here, we report extensive data content updates in all data modules, with direct implications to clinical pharmacogenomics. Also, we report significant new developments in FINDbase, namely (i) the release of a new version of the ETHNOS software that catalyzes development curation of national/ethnic genetic databases, (ii) the migration of all FINDbase data content into 90 distinct national/ethnic mutation databases, all built around Microsoft's PivotViewer (http://www.getpivot.com) software (iii) new data visualization tools and (iv) the interrelation of FINDbase with DruGeVar database with direct implications in clinical pharmacogenomics. The abovementioned updates further enhance the impact of FINDbase, as a key resource for Genomic Medicine applications.