Versatile and efficient variant calling tools are needed to analyze large scale sequencing datasets. In particular, identification of copy number changes remains a challenging task due to their ...complexity, susceptibility to sequencing biases, variation in coverage data and dependence on genome-wide sample properties, such as tumor polyploidy or polyclonality in cancer samples.
We have developed a new tool, Canvas, for identification of copy number changes from diverse sequencing experiments including whole-genome matched tumor-normal and single-sample normal re-sequencing, as well as whole-exome matched and unmatched tumor-normal studies. In addition to variant calling, Canvas infers genome-wide parameters such as cancer ploidy, purity and heterogeneity. It provides fast and easy-to-run workflows that can scale to thousands of samples and can be easily incorporated into variant calling pipelines.
Canvas is distributed under an open source license and can be downloaded from https://github.com/Illumina/canvas
eroller@illumina.com
Supplementary data are available at Bioinformatics online.
Molecular dynamics simulation methods produce trajectories of atomic positions (and optionally velocities and energies) as a function of time and provide a representation of the sampling of a given ...molecule's energetically accessible conformational ensemble. As simulations on the 10−100 ns time scale become routine, with sampled configurations stored on the picosecond time scale, such trajectories contain large amounts of data. Data-mining techniques, like clustering, provide one means to group and make sense of the information in the trajectory. In this work, several clustering algorithms were implemented, compared, and utilized to understand MD trajectory data. The development of the algorithms into a freely available C code library, and their application to a simple test example of random (or systematically placed) points in a 2D plane (where the pairwise metric is the distance between points) provide a means to understand the relative performance. Eleven different clustering algorithms were developed, ranging from top-down splitting (hierarchical) and bottom-up aggregating (including single-linkage edge joining, centroid-linkage, average-linkage, complete-linkage, centripetal, and centripetal-complete) to various refinement (means, Bayesian, and self-organizing maps) and tree (COBWEB) algorithms. Systematic testing in the context of MD simulation of various DNA systems (including DNA single strands and the interaction of a minor groove binding drug DB226 with a DNA hairpin) allows a more direct assessment of the relative merits of the distinct clustering algorithms. Additionally, means to assess the relative performance and differences between the algorithms, to dynamically select the initial cluster count, and to achieve faster data mining by “sieved clustering” were evaluated. Overall, it was found that there is no one perfect “one size fits all” algorithm for clustering MD trajectories and that the results strongly depend on the choice of atoms for the pairwise comparison. Some algorithms tend to produce homogeneously sized clusters, whereas others have a tendency to produce singleton clusters. Issues related to the choice of a pairwise metric, clustering metrics, which atom selection is used for the comparison, and about the relative performance are discussed. Overall, the best performance was observed with the average-linkage, means, and SOM algorithms. If the cluster count is not known in advance, the hierarchical or average-linkage clustering algorithms are recommended. Although these algorithms perform well, it is important to be aware of the limitations or weaknesses of each algorithm, specifically the high sensitivity to outliers with hierarchical, the tendency to generate homogenously sized clusters with means, and the tendency to produce small or singleton clusters with average-linkage.
An ultrafast DNA sequence aligner (Isaac Genome Alignment Software) that takes advantage of high-memory hardware (>48 GB) and variant caller (Isaac Variant Caller) have been developed. We demonstrate ...that our combined pipeline (Isaac) is four to five times faster than BWA + GATK on equivalent hardware, with comparable accuracy as measured by trio conflict rates and sensitivity. We further show that Isaac is effective in the detection of disease-causing variants and can easily/economically be run on commodity hardware.
Isaac has an open source license and can be obtained at https://github.com/sequencing.
Current diagnostic testing for genetic disorders involves serial use of specialized assays spanning multiple technologies. In principle, genome sequencing (GS) can detect all genomic pathogenic ...variant types on a single platform. Here we evaluate copy-number variant (CNV) calling as part of a clinically accredited GS test.
We performed analytical validation of CNV calling on 17 reference samples, compared the sensitivity of GS-based variants with those from a clinical microarray, and set a bound on precision using orthogonal technologies. We developed a protocol for family-based analysis of GS-based CNV calls, and deployed this across a clinical cohort of 79 rare and undiagnosed cases.
We found that CNV calls from GS are at least as sensitive as those from microarrays, while only creating a modest increase in the number of variants interpreted (~10 CNVs per case). We identified clinically significant CNVs in 15% of the first 79 cases analyzed, all of which were confirmed by an orthogonal approach. The pipeline also enabled discovery of a uniparental disomy (UPD) and a 50% mosaic trisomy 14. Directed analysis of select CNVs enabled breakpoint level resolution of genomic rearrangements and phasing of de novo CNVs.
Robust identification of CNVs by GS is possible within a clinical testing environment.
Most tandem mass spectrometry (MS/MS) database search algorithms perform a restrictive search that takes into account only a few types of post-translational modifications (PTMs) and ignores all ...others. We describe an unrestrictive PTM search algorithm, MS-Alignment, that searches for all types of PTMs at once in a blind mode, that is, without knowing which PTMs exist in nature. Blind PTM identification makes it possible to study the extent and frequency of different types of PTMs, still an open problem in proteomics. Application of this approach to lens proteins resulted in the largest set of PTMs reported in human crystallins so far. Our analysis of various MS/MS data sets implies that the biological phenomenon of modification is much more widespread than previously thought. We also argue that MS-Alignment reveals some uncharacterized modifications that warrant further experimental validation.
Reliable identification of posttranslational modifications is key to understanding various cellular regulatory processes. We describe a tool, InsPecT, to identify posttranslational modifications ...using tandem mass spectrometry data. InsPecT constructs database filters that proved to be very successful in genomics searches. Given an MS/MS spectrum S and a database D, a database filter selects a small fraction of database D that is guaranteed (with high probability) to contain a peptide that produced S. InsPecT uses peptide sequence tags as efficient filters that reduce the size of the database by a few orders of magnitude while retaining the correct peptide with very high probability. In addition to filtering, InsPecT also uses novel algorithms for scoring and validating in the presence of modifications, without explicit enumeration of all variants. InsPecT identifies modified peptides with better or equivalent accuracy than other database search tools while being 2 orders of magnitude faster than SEQUEST, and substantially faster than X!TANDEM on complex mixtures. The tool was used to identify a number of novel modifications in different data sets, including many phosphopetides in data provided by Alliance for Cellular Signaling that were missed by other tools.
The central role of protein kinases in signal transduction pathways has generated intense interest in targeting these enzymes for a wide range of therapeutic indications. Here we report a method for ...identifying and quantifying protein kinases in any biological sample or tissue from any species. The procedure relies on acyl phosphate-containing nucleotides, prepared from a biotin derivative and ATP or ADP. The acyl phosphate probes react selectively and covalently at the ATP binding sites of at least 75% of the known human protein kinases. Biotinylated peptide fragments from labeled proteomes are captured and then sequenced and identified using a mass spectrometry-based analysis platform to determine the kinases present and their relative levels. Further, direct competition between the probes and inhibitors can be assessed to determine inhibitor potency and selectivity against native protein kinases, as well as hundreds of other ATPases. The ability to broadly profile kinase activities in native proteomes offers an exciting prospect for both target discovery and inhibitor selectivity profiling.
Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. ...Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.
Microarray experiments measure changes in the expression of thousands of genes. The resulting lists of genes with changes in expression are then searched for biologically related sets using several ...divergent methods such as the Fisher Exact Test (as used in multiple GO enrichment tools), Parametric Analysis of Gene Expression (PAGE), Gene Set Enrichment Analysis (GSEA), and the connectivity map.
We describe an analytical method (Geneva: Gene Vector Analysis) to relate genes to biological properties and to other similar experiments in a uniform way. This new method works on both gene sets and on gene lists/vectors as input queries, and can effectively query databases consisting of sets of biologically related sets, or of results from other microarray experiments. We also present an improvement to the null model estimate by using the empirical background distribution drawn from previous experiments. We validated Geneva by rediscovering a number of previous findings, and by finding significant relationships within microarrays in the GEO repository.
Provided a reasonable corpus of previous experiments is available, this method is more accurate than the class label permutation model, especially for data sets with limited number of replicates. Geneva is, moreover, computationally faster because the background distributions can be precomputed. We also provide a standard evaluation data set based on 5 pairs of related experiments that should share similar functional relationships and 28 pairs of unrelated experiments from GEO. Discovering relationships amongst GEO data sets has implications for drug repositioning, and understanding relationships between diseases and drugs.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Background
: To evaluate the relationship between sleep and next-day physical activity (PA) under free-living conditions in women.
Methods
: Sleep and PA were measured objectively for 7 consecutive ...days by accelerometry in 330 young adult women (aged 17–25 y). A structural equation model was used to evaluate the relationship between the driving factor of sleep (total sleep or morning wake time) and the amount of nonsleep sedentary (SED) and moderate to vigorous physical activity (MVPA) each day.
Results
: With sleep duration as the driving factor, the estimates of β
SED
and β
MVPA
were −0.415 and −0.093, respectively (
P
≤ .05). For every hour slept, a 24.9-minute reduction in SED time and a 5.58-minute reduction in MVPA were observed. With wake time as the driving factor, the estimates of β
SED
and β
MVPA
were −0.636 and −0.149, respectively. For every wake time that was 1 hour later, a 38.2-minute decrease in SED and a 8.9-minute decrease in MVPA (
P
≤ .05) were observed.
Conclusions
: Women who wake later or who sleep longer tend to get less MVPA throughout the day. Getting up earlier and going to bed earlier may support behaviors that improve PA and lifestyle.