Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames, start sites, splice sites, and related structural features. The ...source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures. In addition, the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations, nor do they represent these annotations in a format consistent with current file standards. These frameworks also lack consideration for functional attributes, such as the presence or absence of protein domains that can be used for gene model validation. To provide oversight to the increasing number of published genome annotations, we present a software package, the Gene Filtering, Analysis, and Conversion (gFACs), to filter, analyze, and convert predicted gene models and alignments. The software operates across a wide range of alignment, analysis, and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes. gFACs supports common downstream applications, including genome browsers, and generates extensive details on the filtering process, including distributions that can be visualized to further assess the proposed gene space. gFACs is freely available and implemented in Perl with support from BioPerl libraries at https://gitlab.com/PlantGenomicsLab/gFACs.
Simulations of close relatives and identical by descent (IBD) segments are common in genetic studies, yet most past efforts have utilized sex averaged genetic maps and ignored crossover interference, ...thus omitting features known to affect the breakpoints of IBD segments. We developed Ped-sim, a method for simulating relatives that can utilize either sex-specific or sex averaged genetic maps and also either a model of crossover interference or the traditional Poisson model for inter-crossover distances. To characterize the impact of previously ignored mechanisms, we simulated data for all four combinations of these factors. We found that modeling crossover interference decreases the standard deviation of pairwise IBD proportions by 10.4% on average in full siblings through second cousins. By contrast, sex-specific maps increase this standard deviation by 4.2% on average, and also impact the number of segments relatives share. Most notably, using sex-specific maps, the number of segments half-siblings share is bimodal; and when combined with interference modeling, the probability that sixth cousins have non-zero IBD sharing ranges from 9.0 to 13.1%, depending on the sexes of the individuals through which they are related. We present new analytical results for the distributions of IBD segments under these models and show they match results from simulations. Finally, we compared IBD sharing rates between simulated and real relatives and find that the combination of sex-specific maps and interference modeling most accurately captures IBD rates in real data. Ped-sim is open source and available from https://github.com/williamslab/ped-sim.
Abstract
The giant sequoia (Sequoiadendron giganteum) of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. Genomic data are limited in giant sequoia and ...producing a reference genome sequence has been an important goal to allow marker development for restoration and management. Using deep-coverage Illumina and Oxford Nanopore sequencing, combined with Dovetail chromosome conformation capture libraries, the genome was assembled into eleven chromosome-scale scaffolds containing 8.125 Gbp of sequence. Iso-Seq transcripts, assembled from three distinct tissues, was used as evidence to annotate a total of 41,632 protein-coding genes. The genome was found to contain, distributed unevenly across all 11 chromosomes and in 63 orthogroups, over 900 complete or partial predicted NLR genes, of which 375 are supported by annotation derived from protein evidence and gene modeling. This giant sequoia reference genome sequence represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management.
Premise
Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically ...lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein‐coding gene predictions.
Methods
The impact of repeat masking, long‐read and short‐read inputs, and de novo and genome‐guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity.
Results
Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono‐exonic/multi‐exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA‐read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence‐based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome‐guided transcriptome assemblies, or full‐length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post‐processing with functional and structural filters is highly recommended.
Discussion
While the annotation of non‐model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.
Premise
An informatics approach was used for the construction of an Axiom genotyping array from heterogeneous, high‐throughput sequence data to assess the complex genome of loblolly pine (Pinus ...taeda).
Methods
High‐throughput sequence data, sourced from exome capture and whole genome reduced‐representation approaches from 2698 trees across five sequence populations, were analyzed with the improved genome assembly and annotation for the loblolly pine. A variant detection, filtering, and probe design pipeline was developed to detect true variants across and within populations. From 8.27 million variants, a total of 642,275 were evaluated and 423,695 of those were screened across a range‐wide population.
Results
The final informatics and screening approach delivered an Axiom array representing 46,439 high‐confidence variants to the forest tree breeding and genetics community. Based on the annotated reference genome, 34% were located in or directly upstream or downstream of genic regions.
Discussion
The Pita50K array represents a genome‐wide resource developed from sequence data for an economically important conifer, loblolly pine. It uniquely integrates independent projects that assessed trees sampled across the native range. The challenges associated with the large and repetitive genome are addressed in the development of this resource.
Somatic mutations have important biological ramifications while exerting substantial rate, type, and genomic location heterogeneity. Yet, their sporadic occurrence makes them difficult to study at ...scale and across individuals. Lymphoblastoid cell lines (LCLs), a model system for human population and functional genomics, harbor large numbers of somatic mutations and have been extensively genotyped. By comparing 1,662 LCLs, we report that the mutational landscape of the genome varies across individuals in terms of the number of mutations, their genomic locations, and their spectra; this variation may itself be modulated by somatic trans-acting mutations. Mutations attributed to the translesion DNA polymerase η follow two different modes of formation, with one mode accounting for the hypermutability of the inactive X chromosome. Nonetheless, the distribution of mutations along the inactive X chromosome appears to follow an epigenetic memory of the active form.
Display omitted
•Analysis of 885,655 mutations from 1,662 lymphoblastoid cell lines (LCLs)•Inter-individual variation in the rate and genomic distribution of mutations•BCL6 is a candidate modulator of the mutational landscape in LCLs•Hypermutation of the inactive X chromosome is attributed to DNA polymerase η
An analysis of the mutational landscape in 1,662 individuals reveals genome-wide variation in mutational loads, genomic distribution, and signatures, all of which appear to be modulated by somatic mutations in trans. The inactive X chromosome is unusual in bearing an excess of replication-timing-uncoupled DNA polymerase η-mediated mutations.
The patterns of genomic mutations are associated with various genomic features, most notably late replication timing, yet it remains contested which mutation types and signatures relate to DNA ...replication dynamics and to what extent. Here, we perform high-resolution comparisons of mutational landscapes between lymphoblastoid cell lines, chronic lymphocytic leukemia tumors, and three colon adenocarcinoma cell lines, including two with mismatch repair deficiency. Using cell-type-matched replication timing profiles, we demonstrate that mutation rates exhibit heterogeneous replication timing associations among cell types. This cell-type heterogeneity extends to the underlying mutational pathways, as mutational signatures show inconsistent replication timing bias between cell types. Moreover, replicative strand asymmetries exhibit similar cell-type specificity, albeit with different relationships to replication timing than mutation rates. Overall, we reveal an underappreciated complexity and cell-type specificity of mutational pathways and their relationship to replication timing.
Display omitted
•Analysis of somatic mutations and matched replication timing in five cell types•Cell-type variability in mutation bias toward late replicating regions•Cell-type heterogeneity in mutational pathways and replicative strand asymmetry
An analysis of the mutational landscape in five cell types reveals diverse relationships between mutation rate, the prevalence of mutational pathways, and replicative strand asymmetry, all in relation to DNA replication timing.
Abstract
Regulation of DNA replication and copy number is necessary to promote genome stability and maintain cell and tissue function. DNA replication is regulated temporally in a process known as ...replication timing (RT). Rap1-interacting factor 1 (Rif1) is a key regulator of RT and has a critical function in copy number control in polyploid cells. Previously, we demonstrated that Rif1 functions with SUUR to inhibit replication fork progression and promote underreplication (UR) of specific genomic regions. How Rif1-dependent control of RT factors into its ability to promote UR is unknown. By applying a computational approach to measure RT in Drosophila polyploid cells, we show that SUUR and Rif1 have differential roles in controlling UR and RT. Our findings reveal that Rif1 acts to promote late replication, which is necessary for SUUR-dependent underreplication. Our work provides new insight into the process of UR and its links to RT.
In polyploid cells, copy number is not uniform throughout the genome. Underreplication causes pericentric heterochromatin and defined regions of euchromatin to have reduced copy number relative to overall ploidy. SUUR and Rif1 are key regulators of underreplication, and Rif1 is a known regulator of replication timing (RT). Das et al. take a computational approach to measure replication timing in polyploid cells. Their results demonstrate that, while Rif1 and SUUR both promote underreplication, they differentially affect RT.