We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS ...with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.
Negative selection against deleterious alleles produced by mutation influences within-population variation as the most pervasive form of natural selection. However, it is not known whether ...deleterious alleles affect fitness independently, so that cumulative fitness loss depends exponentially on the number of deleterious alleles, or synergistically, so that each additional deleterious allele results in a larger decrease in relative fitness. Negative selection with synergistic epistasis should produce negative linkage disequilibrium between deleterious alleles and, therefore, an underdispersed distribution of the number of deleterious alleles in the genome. Indeed, we detected underdispersion of the number of rare loss-of-function alleles in eight independent data sets from human and fly populations. Thus, selection against rare protein-disrupting alleles is characterized by synergistic epistasis, which may explain how human and fly populations persist despite high genomic mutation rates.
There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use ...the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA's applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects. Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA.
During a meeting of the SYSGENET working group 'Bioinformatics', currently available software tools and databases for systems genetics in mice were reviewed and the needs for future developments ...discussed. The group evaluated interoperability and performed initial feasibility studies. To aid future compatibility of software and exchange of already developed software modules, a strong recommendation was made by the group to integrate HAPPY and R/qtl analysis toolboxes, GeneNetwork and XGAP database platforms, and TIQS and xQTL processing platforms. R should be used as the principal computer language for QTL data analysis in all platforms and a 'cloud' should be used for software dissemination to the community. Furthermore, the working group recommended that all data models and software source code should be made visible in public repositories to allow a coordinated effort on the use of common data structures and file formats.
Whole-genome sequencing enables complete characterization of genetic variation, but geographic clustering of rare alleles demands many diverse populations be studied. Here we describe the Genome of ...the Netherlands (GoNL) Project, in which we sequenced the whole genomes of 250 Dutch parent-offspring families and constructed a haplotype map of 20.4 million single-nucleotide variants and 1.2 million insertions and deletions. The intermediate coverage (∼13×) and trio design enabled extensive characterization of structural variation, including midsize events (30-500 bp) previously poorly catalogued and de novo mutations. We demonstrate that the quality of the haplotypes boosts imputation accuracy in independent samples, especially for lower frequency alleles. Population genetic analyses demonstrate fine-scale structure across the country and support multiple ancient migrations, consistent with historical changes in sea level and flooding. The GoNL Project illustrates how single-population whole-genome sequencing can provide detailed characterization of genetic variation and may guide the design of future population studies.
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative ...experimental opportunities. One such organism-Caenorhabditis elegans-has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTL(HD) (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene-disease associations in man. WormQTL(HD), available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene-disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench.
Ten quick tips for building FAIR workflows Casper de Visser; Lennart F Johansson; Purva Kulkarni ...
PLoS computational biology,
09/2023, Letnik:
19, Številka:
9
Journal Article
Recenzirano
Odprti dostop
Research data is accumulating rapidly and with it the challenge of fully reproducible science. As a consequence, implementation of high-quality management of scientific data has become a global ...priority. The FAIR (Findable, Accesible, Interoperable and Reusable) principles provide practical guidelines for maximizing the value of research data; however, processing data using workflows-systematic executions of a series of computational tools-is equally important for good data management. The FAIR principles have recently been adapted to Research Software (FAIR4RS Principles) to promote the reproducibility and reusability of any type of research software. Here, we propose a set of 10 quick tips, drafted by experienced workflow developers that will help researchers to apply FAIR4RS principles to workflows. The tips have been arranged according to the FAIR acronym, clarifying the purpose of each tip with respect to the FAIR4RS principles. Altogether, these tips can be seen as practical guidelines for workflow developers who aim to contribute to more reproducible and sustainable computational science, aiming to positively impact the open science and FAIR community.
TRIP4 is one of the subunits of the transcriptional coregulator ASC-1, a ribonucleoprotein complex that participates in transcriptional coactivation and RNA processing events. Recessive variants in ...the TRIP4 gene have been associated with spinal muscular atrophy with bone fractures as well as a severe form of congenital muscular dystrophy. Here we present the diagnostic journey of a patient with cerebellar hypoplasia and spinal muscular atrophy (PCH1) and congenital bone fractures. Initial exome sequencing analysis revealed no candidate variants. Reanalysis of the exome data by inclusion in the Solve-RD project resulted in the identification of a homozygous stop-gain variant in the TRIP4 gene, previously reported as disease-causing. This highlights the importance of analysis reiteration and improved and updated bioinformatic pipelines. Proteomic profile of the patient's fibroblasts showed altered RNA-processing and impaired exosome activity supporting the pathogenicity of the detected variant. In addition, we identified a novel genetic form of PCH1, further strengthening the link of this characteristic phenotype with altered RNA metabolism.
Almost half of all individuals affected by intellectual disability (ID) remain undiagnosed. In the Solve-RD project, exome sequencing (ES) datasets from unresolved individuals with (syndromic) ID ...(n = 1,472 probands) are systematically reanalyzed, starting from raw sequencing files, followed by genome-wide variant calling and new data interpretation. This strategy led to the identification of a disease-causing de novo missense variant in TUBB3 in a girl with severe developmental delay, secondary microcephaly, brain imaging abnormalities, high hypermetropia, strabismus and short stature. Interestingly, the TUBB3 variant could only be identified through reanalysis of ES data using a genome-wide variant calling approach, despite being located in protein coding sequence. More detailed analysis revealed that the position of the variant within exon 5 of TUBB3 was not targeted by the enrichment kit, although consistent high-quality coverage was obtained at this position, resulting from nearby targets that provide off-target coverage. In the initial analysis, variant calling was restricted to the exon targets ± 200 bases, allowing the variant to escape detection by the variant calling algorithm. This phenomenon may potentially occur more often, as we determined that 36 established ID genes have robust off-target coverage in coding sequence. Moreover, within these regions, for 17 genes (likely) pathogenic variants have been identified before. Therefore, this clinical report highlights that, although compute-intensive, performing genome-wide variant calling instead of target-based calling may lead to the detection of diagnostically relevant variants that would otherwise remain unnoticed.
The genetic etiology of intellectual disability remains elusive in almost half of all affected individuals. Within the Solve-RD consortium, systematic re-analysis of whole exome sequencing (WES) data ...from unresolved cases with (syndromic) intellectual disability (n = 1,472 probands) was performed. This re-analysis included variant calling of mitochondrial DNA (mtDNA) variants, although mtDNA is not specifically targeted in WES. We identified a functionally relevant mtDNA variant in MT-TL1 (NC_012920.1:m.3291T > C; NC_012920.1:n.62T > C), at a heteroplasmy level of 22% in whole blood, in a 23-year-old male with severe intellectual disability, epilepsy, episodic headaches with emesis, spastic tetraparesis, brain abnormalities, and feeding difficulties. Targeted validation in blood and urine supported pathogenicity, with heteroplasmy levels of 23% and 58% in index, and 4% and 17% in mother, respectively. Interestingly, not all phenotypic features observed in the index have been previously linked to this MT-TL1 variant, suggesting either broadening of the m.3291T > C-associated phenotype, or presence of a co-occurring disorder. Hence, our case highlights the importance of underappreciated mtDNA variants identifiable from WES data, especially for cases with atypical mitochondrial phenotypes and their relatives in the maternal line.