We wish to identify determinants of endothelial lineage. Murine embryonic stem cells (mESC) were fused with human endothelial cells in stable, non-dividing, heterokaryons. Using RNA-seq, it is ...possible to discriminate between human and mouse transcripts in these chimeric heterokaryons. We observed a temporal pattern of gene expression in the ESCs of the heterokaryons that recapitulated ontogeny, with early mesodermal factors being expressed before mature endothelial genes. A set of transcriptional factors not known to be involved in endothelial development was upregulated, one of which was POU class 3 homeobox 2 (Pou3f2). We confirmed its importance in differentiation to endothelial lineage via loss- and gain-of-function (LOF and GOF). Its role in vascular development was validated in zebrafish embryos using morpholino oligonucleotides. These studies provide a systematic and mechanistic approach for identifying key regulators in directed differentiation of pluripotent stem cells to somatic cell lineages.
Human Embryonic Stem Cells (hESCs) are in vitro derivatives of the inner cell mass of the blastocyst and are characterized by an undifferentiated and pluripotent state that can be perpetuated in ...time, indefinitely. hESCs provide a unique opportunity to both dissect the molecular mechanisms that are predisposed to the maintenance of pluripotency and model the ability to initiate differentiation and cell commitment within the developing embryo. To fully understand these mechanisms, it is necessary to accurately identify the specific transcriptome of hESCs. Many distinct gene annotation methods, such as cDNA and EST sequencing and RNA-Seq, have been used to identify the transcriptome of hESCs. Lately, we developed a new tool (IDP) to integrate the hybrid sequencing data to characterize a more reliable and comprehensive hESC transcriptome with discoveries of many novel transcripts.
Abstract
The intra-tumor heterogeneity is associated with cancer progression and therapeutic resistance, such as in breast cancer. While the existing methods for studying tumor heterogeneity only ...analyze variant allele frequency (VAF), the genotype of variant is also informative for inferring subclones, which can be detected by long reads or paired-end reads. We developed GenoClone to integrate VAF with the genotype of variant innovatively, so it showed superior performance of inferring the number of subclones, estimating the fractions of subclones and identifying somatic single-nucleotide variants composition of subclones. When GenoClone was applied to 389 TCGA breast cancer samples, it revealed extensive intra-tumor heterogeneity. We further found that a few somatic mutations were relevant to the late stage of tumor evolution, including the ones at the oncogene PIK3CA and the tumor suppress gene TP53. Moreover, 52 subclones that were identified from 167 samples shared high similarity of somatic mutations, which were clustered into three groups with the sizes of 24, 14 and 14. It is helpful for understanding the development of breast cancer in certain subgroups of people and the drug development for population level. Furthermore, GenoClone also identified the tumor heterogeneity in different aliquots of the same samples. The implementation of GenoClone is available at http://www.healthcare.uiowa.edu/labs/au/GenoClone/.
Abstract
Motivation
In the past years, the long read (LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies, have been demonstrated to substantially improve the ...quality of genome assembly and transcriptome characterization. Compared to the high cost of genome assembly by LR sequencing, it is more affordable to generate LRs for transcriptome characterization. That is, when informative transcriptome LR data are available without a high-quality genome, a method for de novo transcriptome assembly and annotation is of high demand.
Results
Without a reference genome, IDP-denovo performs de novo transcriptome assembly, isoform annotation and quantification by integrating the strengths of LRs and short reads. Using the GM12878 human data as a gold standard, we demonstrated that IDP-denovo had superior sensitivity of transcript assembly and high accuracy of isoform annotation. In addition, IDP-denovo outputs two abundance indices to provide a comprehensive expression profile of genes/isoforms. IDP-denovo represents a robust approach for transcriptome assembly, isoform annotation and quantification for non-model organism studies. Applying IDP-denovo to a non-model organism, Dendrobium officinale, we discovered a number of novel genes and novel isoforms that were not reported by the existing annotation library. These results reveal the high diversity of gene isoforms in D.officinale, which was not reported in the existing annotation library.
Availability and implementation
The dataset of Dendrobium officinale used/analyzed during the current study has been deposited in SRA, with accession code SRP094520. IDP-denovo is available for download at www.healthcare.uiowa.edu/labs/au/IDP-denovo/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation ...sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50-100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.
Abstract
Introduction: Genome-wide CRISPR-Cas9 based loss-of-function screens can be used to find essential genes for proliferation and survival of cancer cells. While recent studies have focused on ...establishing reference sets of essential and non-essential genes, correcting copy number effect and characterizing off-target effect, it lacks in-depth studies of the effects of gene abundance and sgRNAs that targeting multi-genomic loci. To fill this gap timely, we here present a bioinformatics workflow to reduce false positives in CRISPR-Cas9 screens.
Description: Gastric adenocarcinoma cell line AGS was infected with CRISPR knockout library (TKOv3) at a multiplicity of infection of 0.3~0.4. We used the cells right after puromycin selection as the baseline sample, and the cells cultured for 14 days or 20 days as the negative selection samples. The sgRNA inserts were amplified by PCR and the corresponding libraries were sequenced on NextSeq 500 with a single-end 75 bp run, followed by analysis by MAGeCK. The read counts of sgRNAs were normalized by non-essential genes to reduce false positives. The RNA-seq data and copy number data were obtained by CCLE portal. To characterize sgRNAs targeting multiple-genomic loci, Bowtie was used to align sgRNA to the reference human genome (GRCh38) with no mismatch, and only the alignments followed by NGG PAM site were remained for downstream analysis.
Summary: Integration of RNA-seq data with CRISPR negative screen results showed that the selection signal was noisy for the lowly expressed genes. The fraction of selected essential genes (overall FDR<0.05, absolute value of beta score >1) was as low as 0.11% among the genes with the bottom 10% expression level, while 27% among the genes with the top 10% expression level. After filtering out the lowly expressed genes (<0.06 RPKM), the selected essential genes had an FDR much closer to 0. Out of the 40 essential genes selected without filtering out lowly expressed genes, none of them was reported oncogenes in literature. To study the influences of multiple alignments of sgRNAs, we only considered the ones with perfect alignments (i.e., no mismatch) so that we can prevent it from being confounding with off-target effects caused by mismatch tolerance. Log fold changes in read counts were calculated for each sgRNA between a later time point (day 14 or 20) vs. baseline (day 0). The median log fold change significantly decreased as a function of the number of perfect alignments (p = 0.0001, Jonckheere trend test). This supports the hypothesis that a sgRNA aligned to several DNA targets will introduce multiple double stranded cuts, and thus will result in biased essentiality scores.
Conclusions: Filtering out lowly-expressed genes prior to CRISPR screen data analysis can reduce false positives. In addition, multiple-target sgRNAs can lead to false positives but the effect needs further analysis in a case by case manner.
Citation Format: Yue Zhao, Xue Wu, Yuru Wang, Kin Fai Au, Lijun Cheng, Lang Li. New bioinformatics workflow of genome-wide CRISPR-Cas9 knockout screens abstract. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 830.
Clustered regularly interspaced short palindromic repeats (CRISPR)-based genetic perturbation screen is a powerful tool to probe gene function. However, experimental noises, especially for the lowly ...expressed genes, need to be accounted for to maintain proper control of false positive rate.
We develop a statistical method, named CRISPR screen with Expression Data Analysis (CEDA), to integrate gene expression profiles and CRISPR screen data for identifying essential genes. CEDA stratifies genes based on expression level and adopts a three-component mixture model for the log-fold change of single-guide RNAs (sgRNAs). Empirical Bayesian prior and expectation-maximization algorithm are used for parameter estimation and false discovery rate inference.
Taking advantage of gene expression data, CEDA identifies essential genes with higher expression. Compared to existing methods, CEDA shows comparable reliability but higher sensitivity in detecting essential genes with moderate sgRNA fold change. Therefore, using the same CRISPR data, CEDA generates an additional hit gene list.
Supplementary data are available at Bioinformatics online.
The most frequently mutated protein in human cancer is p53, a transcription factor (TF) that regulates myriad genes instrumental in diverse cellular outcomes including growth arrest and cell death. ...Cell context-dependent p53 modulation is critical for this life-or-death balance, yet remains incompletely understood. Here we identify sequence signatures enriched in genomic p53-binding sites modulated by the transcription cofactor iASPP. Moreover, our p53–iASPP crystal structure reveals that iASPP displaces the p53 L1 loop—which mediates sequence-specific interactions with the signature-corresponding base—without perturbing other DNA-recognizing modules of the p53 DNA-binding domain. A TF commonly uses multiple structural modules to recognize its cognate DNA, and thus this mechanism of a cofactor fine-tuning TF–DNA interactions through targeting a particular module is likely widespread. Previously, all tumor suppressors and oncoproteins that associate with the p53 DNA-binding domain—except the oncogenic E6 from human papillomaviruses (HPVs)—structurally cluster at the DNA-binding site of p53, complicating drug design. By contrast, iASPP inhibits p53 through a distinct surface overlapping the E6 footprint, opening prospects for p53-targeting precision medicine to improve cancer therapy.
Accurate mapping of RNA-Seq data Au, Kin Fai
Methods in molecular biology (Clifton, N.J.),
01/2015, Volume:
1269
Journal Article
The mapping of RNA-Seq data on genome is not the same as DNA-Seq data, because the junction reads span two exons and have no identical matches at reference genome. In this chapter, we describe a ...junction read aligner SpliceMap that is based on an algorithm of "half-read seeding" and "seeding extension." Four analysis steps are integrated in SpliceMap (half-read mapping, seeding selection, seeding extension and junction search, and paired-end filtering), and all toning parameters of these steps can be editable in a single configuration file. Thus, SpliceMap can be executed by a single command. While we describe the analysis steps of SpliceMap, we illustrate how to choose the parameters according to the research interest and RNA-Seq data quality by an example of human brain RNA-Seq data.