Understanding and correctly utilizing relatedness among samples is essential for genetic analysis; however, managing sample records and pedigrees can often be error prone and incomplete. Data sets ...ascertained by random sampling often harbor cryptic relatedness that can be leveraged in genetic analyses for maximizing power. We have developed a method that uses genome-wide estimates of pairwise identity by descent to identify families and quickly reconstruct and score all possible pedigrees that fit the genetic data by using up to third-degree relatives, and we have included it in the software package PRIMUS (Pedigree Reconstruction and Identification of the Maximally Unrelated Set). Here, we validate its performance on simulated, clinical, and HapMap pedigrees. Among these samples, we demonstrate that PRIMUS can verify reported pedigree structures and identify cryptic relationships. Finally, we show that PRIMUS reconstructed pedigrees, all of which were previously unknown, for 203 families from a cohort collected in Starr County, TX (1,890 samples).
Prognostic tools are required to guide clinical decision-making in COVID-19.
We studied the relationship between the ratio of interleukin (IL)-6 to IL-10 and clinical outcome in 80 patients ...hospitalized for COVID-19, and created a simple 5-point linear score predictor of clinical outcome, the Dublin-Boston score. Clinical outcome was analysed as a three-level ordinal variable (“Improved”, “Unchanged”, or “Declined”). For both IL-6:IL-10 ratio and IL-6 alone, we associated clinical outcome with a) baseline biomarker levels, b) change in biomarker level from day 0 to day 2, c) change in biomarker from day 0 to day 4, and d) slope of biomarker change throughout the study. The associations between ordinal clinical outcome and each of the different predictors were performed with proportional odds logistic regression. Associations were run both “unadjusted” and adjusted for age and sex. Nested cross-validation was used to identify the model for incorporation into the Dublin-Boston score.
The 4-day change in IL-6:IL-10 ratio was chosen to derive the Dublin-Boston score. Each 1 point increase in the score was associated with a 5.6 times increased odds for a more severe outcome (OR 5.62, 95% CI -3.22–9.81, P = 1.2 × 10−9). Both the Dublin-Boston score and the 4-day change in IL-6:IL-10 significantly outperformed IL-6 alone in predicting clinical outcome at day 7.
The Dublin-Boston score is easily calculated and can be applied to a spectrum of hospitalized COVID-19 patients. More informed prognosis could help determine when to escalate care, institute or remove mechanical ventilation, or drive considerations for therapies.
Funding was received from the Elaine Galwey Research Fellowship, American Thoracic Society, National Institutes of Health and the Parker B Francis Research Opportunity Award.
The identification and understanding of gene-environment interactions can provide insights into the pathways and mechanisms underlying complex diseases. However, testing for gene-environment ...interaction remains a challenge since a.) statistical power is often limited and b.) modeling of environmental effects is nontrivial and such model misspecifications can lead to false positive interaction findings. To address the lack of statistical power, recent methods aim to identify interactions on an aggregated level using, for example, polygenic risk scores. While this strategy can increase the power to detect interactions, identifying contributing genes and pathways is difficult based on these relatively global results. Here, we propose RITSS (Robust Interaction Testing using Sample Splitting), a gene-environment interaction testing framework for quantitative traits that is based on sample splitting and robust test statistics. RITSS can incorporate sets of genetic variants and/or multiple environmental factors. Based on the user's choice of statistical/machine learning approaches, a screening step selects and combines potential interactions into scores with improved interpretability. In the testing step, the application of robust statistics minimizes the susceptibility to main effect misspecifications. Using extensive simulation studies, we demonstrate that RITSS controls the type 1 error rate in a wide range of scenarios, and we show how the screening strategy influences statistical power. In an application to lung function phenotypes and human height in the UK Biobank, RITSS identified highly significant interactions based on subcomponents of genetic risk scores. While the contributing single variant interaction signals are weak, our results indicate interaction patterns that result in strong aggregated effects, providing potential insights into underlying gene-environment interaction mechanisms.
Family with Sequence Similarity 13, Member A (FAM13A) gene has been consistently associated with COPD by Genome-wide association studies (GWAS). Our previous study demonstrated that FAM13A was mainly ...expressed in the lung epithelial progenitors including Club cells and alveolar type II epithelial (ATII) cells. Fam13a−/− mice were resistant to cigarette smoke (CS)–induced emphysema through promoting β-catenin/Wnt activation. Given the important roles of β-catenin/Wnt activation in alveolar regeneration during injury, it is unclear when and where FAM13A regulates the Wnt pathway, the requisite pathway for alveolar epithelial repair, in vivo during CS exposure in lung epithelial progenitors.
Fam13a+/+ or Fam13a−/− mice were crossed with TCF/Lef:H2B-GFP Wnt-signaling reporter mouse line to indicate β-catenin/Wnt-activated cells labeled with GFP followed by acute (1 month) or chronic (7 months) CS exposure. Fluorescence-activated flow cytometry analysis, immunofluorescence and organoid culture system were performed to identify the β-catenin/Wnt-activated cells in Fam13a+/+ or Fam13a−/− mice exposed to CS. Fam13a;SftpcCreERT2;Rosa26RmTmG mouse line, where GFP labels ATII cells, was generated for alveolar organoid culture followed by analyses of organoid number, immunofluorescence and gene expression. Single cell RNA-seq data from COPD ever smokers and nonsmoker control lungs were further analyzed.
We found that FAM13A-deficiency significantly increased Wnt activation mainly in lung epithelial cells. Consistently, after long-term CS exposure in vivo, FAM13A deficiency bestows alveolar epithelial progenitor cells with enhanced proliferation and differentiation in the ex vivo organoid model. Importantly, expression of FAM13A is significantly increased in human COPD-derived ATII cells compared to healthy ATII cells as suggested by single cell RNA-sequencing data.
Our findings suggest that FAM13A-deficiency promotes the Wnt pathway-mediated ATII cell repair/regeneration, and thereby possibly mitigating CS-induced alveolar destruction.
This project is funded by the National Institutes of Health of United States of America (NIH) grants R01HL127200, R01HL137927, R01HL148667 and R01HL147148 (XZ).
As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their ...enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed.
Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs.
The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.
The serpin family A member 1 (SERPINA1) Z allele is present in approximately one in 25 individuals of European ancestry. Z allele homozygosity (Pi*ZZ) is the most common cause of alpha 1‐antitrypsin ...deficiency and is a proven risk factor for cirrhosis. We examined whether heterozygous Z allele (Pi*Z) carriers in United Kingdom (UK) Biobank, a population‐based cohort, are at increased risk of liver disease. We replicated findings in Massachusetts General Brigham Biobank, a hospital‐based cohort. We also examined variants associated with liver disease and assessed for gene–gene and gene–environment interactions. In UK Biobank, we identified 1,493 cases of cirrhosis, 12,603 Z allele heterozygotes, and 129 Z allele homozygotes among 312,671 unrelated white British participants. Heterozygous carriage of the Z allele was associated with cirrhosis compared to noncarriage (odds ratio OR, 1.53; P = 1.1×10−04); homozygosity of the Z allele also increased the risk of cirrhosis (OR, 11.8; P = 1.8 × 10−09). The OR for cirrhosis of the Z allele was comparable to that of well‐established genetic variants, including patatin‐like phospholipase domain containing 3 (PNPLA3) I148M (OR, 1.48; P = 1.1 × 10−22) and transmembrane 6 superfamily member 2 (TM6SF2) E167K (OR, 1.34; P = 2.6 × 10−06). In heterozygotes compared to noncarriers, the Z allele was associated with higher alanine aminotransferase (ALT; P = = 4.6 × 10−46), aspartate aminotransferase (AST; P = 2.2 × 10−27), alkaline phosphatase (P = 3.3 × 10−43), gamma‐glutamyltransferase (P = 1.2 × 10−05), and total bilirubin (P = 6.4 × 10−06); Z allele homozygotes had even greater elevations in liver biochemistries. Body mass index (BMI) amplified the association of the Z allele for ALT (P interaction = 0.021) and AST (P interaction = 0.0040), suggesting a gene–environment interaction. Finally, we demonstrated genetic interactions between variants in PNPLA3, TM6SF2, and hydroxysteroid 17‐beta dehydrogenase 13 (HSD17B13); there was no evidence of epistasis between the Z allele and these variants. Conclusion: SERPINA1 Z allele heterozygosity is an important risk factor for liver disease; this risk is amplified by increasing BMI.
Family-based designs have been shown to be powerful in detecting the significant rare variants associated with human diseases. However, very few significant results have been found owing to ...relatively small sample sizes and the fact that statistical analyses often suffer from high false-negative error rates. These limitations can be avoided by combining results from multiple studies via meta-analysis. However, statistical methods for meta-analysis with rare variants are limited for family-based samples. In this report, we propose a tool for the meta-analysis of family-based rare variant associations, metaFARVAT. metaFARVAT is based on a quasi-likelihood score for each variant. These scores are combined to generate burden test, variable-threshold test, sequence kernel association test (SKAT), and optimal SKAT statistics. The proposed method tests homogeneous and heterogeneous effects of variants among different studies and can be applied to both quantitative and dichotomous phenotypes. Simulation results demonstrated the robustness and efficiency of the proposed method in different scenarios. By applying metaFARVAT to data from a family-based study and a case-control study, we identified a few promising candidate genes, including
, which is associated with chronic obstructive pulmonary disease.
A Mendelian transmission produces phenotypic and genetic relatedness between family members, giving family-based analytical methods an important role in genetic epidemiological studies-from ...heritability estimations to genetic association analyses. With the advance in genotyping technologies, whole-genome sequence data can be utilized for genetic epidemiological studies, and family-based samples may become more useful for detecting de novo mutations. However, genetic analyses employing family-based samples usually suffer from the complexity of the computational/statistical algorithms, and certain types of family designs, such as incorporating data from extended families, have rarely been used.
We present a Workbench for Integrated Superfast Association studies for Related Data (WISARD) programmed in C/C++. WISARD enables the fast and a comprehensive analysis of SNP-chip and next-generation sequencing data on extended families, with applications from designing genetic studies to summarizing analysis results. In addition, WISARD can automatically be run in a fully multithreaded manner, and the integration of R software for visualization makes it more accessible to non-experts.
Comparison with existing toolsets showed that WISARD is computationally suitable for integrated analysis of related subjects, and demonstrated that WISARD outperforms existing toolsets. WISARD has also been successfully utilized to analyze the large-scale massive sequencing dataset of chronic obstructive pulmonary disease data (COPD), and we identified multiple genes associated with COPD, which demonstrates its practical value.
Family‐based designs have been repeatedly shown to be powerful in detecting the significant rare variants associated with human diseases. Furthermore, human diseases are often defined by the outcomes ...of multiple phenotypes, and thus we expect multivariate family‐based analyses may be very efficient in detecting associations with rare variants. However, few statistical methods implementing this strategy have been developed for family‐based designs. In this report, we describe one such implementation: the multivariate family‐based rare variant association tool (mFARVAT). mFARVAT is a quasi‐likelihood‐based score test for rare variant association analysis with multiple phenotypes, and tests both homogeneous and heterogeneous effects of each variant on multiple phenotypes. Simulation results show that the proposed method is generally robust and efficient for various disease models, and we identify some promising candidate genes associated with chronic obstructive pulmonary disease. The software of mFARVAT is freely available at http://healthstat.snu.ac.kr/software/mfarvat/, implemented in C++ and supported on Linux and MS Windows.