Given the wide variability in the quality of next-generation sequencing data submitted to public repositories, it is essential to identify methods that can perform quality control on these data sets ...when additional quality control data, such as mean tile data, are missing from public repositories. In this study, we present evidence that correlating counts of reads corresponding to pairs of motifs separated over specific distances on individual exons can be used as a proxy mean tile data in the data sets we analyzed and hence could be used when mean tile data are not available. As test data sets we use the
in vitro transcribed (IVT) data set, and a
data set comprising wild and mutant types. We find that a FastQC analysis of the available parts of these data sets demonstrates that the per-tile sequencing quality is good for all the data sets apart from the mutant-type data where the mutant-r3 data are worse than the mutant-r2 data. Correspondingly, intra-exon motif correlations are reasonably large for all data sets except this latter case where the mutant-r2 correlations are low and the mutant-r3 correlations close to zero. We propose that these extremely low correlations are indicative of bias of technical origin, such as flowcell errors. In addition to this, the
motif correlations as a function of both guanosine-cytosine (GC) content parameters are somewhat higher and less dependent on the GC content parameters in the
messenger RNA (mRNA)
RNA-Seq sample (control) than in the other RNA-Seq samples that did undergo mRNA selection: both ribosomal depletion (
) and PolyA selection (
, wild type, and mutant).
We discuss the applicability of the Microsoft cloud computing platform, Azure, for bioinformatics. We focus on the usability of the resource rather than its performance. We provide an example of how ...R can be used on Azure to analyse a large amount of microarray expression data deposited at the public database ArrayExpress. We provide a walk through to demonstrate explicitly how Azure can be used to perform these analyses in Appendix S1 and we offer a comparison with a local computation. We note that the use of the Platform as a Service (PaaS) offering of Azure can represent a steep learning curve for bioinformatics developers who will usually have a Linux and scripting language background. On the other hand, the presence of an additional set of libraries makes it easier to deploy software in a parallel (scalable) fashion and explicitly manage such a production run with only a few hundred lines of code, most of which can be incorporated from a template. We propose that this environment is best suited for running stable bioinformatics software by users not involved with its development.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Phytohormones regulate plant growth from cell division to organ development. Jasmonates (JAs) are signaling molecules that have been implicated in stress-induced responses. However, they have also ...been shown to inhibit plant growth, but the mechanisms are not well understood. The effects of methyl jasmonate (MeJA) on leaf growth regulation were investigated in Arabidopsis (Arabidopsis thaliana) mutants altered in JA synthesis and perception, allene oxide synthase and coil-16B (for coronatine insensitive1), respectively. We show that MeJA inhibits leaf growth through the JA receptor COI1 by reducing both cell number and size. Further investigations using flow cytometry analyses allowed us to evaluate ploidy levels and to monitor cell cycle progression in leaves and cotyledons of Arabidopsis and/or Nicotiana benthamiana at different stages of development. Additionally, a novel global transcription profiling analysis involving continuous treatment with MeJA was carried out to identify the molecular players whose expression is regulated during leaf development by this hormone and COI1. The results of these studies revealed that MeJA delays the switch from the mitotic cell cycle to the endoreduplication cycle, which accompanies cell expansion, in a COI1 -dependent manner and inhibits the mitotic cycle itself, arresting cells in G1 phase prior to the S-phase transition. Significantly, we show that MeJA activates critical regulators of endoreduplication and affects the expression of key determinants of DNA replication. Our discoveries also suggest that MeJA may contribute to the maintenance of a cellular "stand-by mode" by keeping the expression of ribosomal genes at an elevated level. Finally, we propose a novel model for MeJA-regulated COI1 -dependent leaf growth inhibition.
Abstract
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the ...Protein Data Bank that is key for high-throughput studies of, for example, protein–ligand docking, clustering of protein–ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The FAIR data principles are rapidly becoming a standard through which to assess responsible and reproducible research. In contrast to the requirements associated with the Interoperability principle, ...the requirements associated with the Accessibility principle are often assumed to be relatively straightforward to implement. Indeed, a variety of different tools assessing FAIR rely on the data being deposited in a trustworthy digital repository. In this paper we note that there is an implicit assumption that access to a repository is independent of where the user is geographically located. Using a virtual personal network (VPN) service we find that access to a set of web sites that underpin Open Science is variable from a set of 14 countries; either through connectivity issues (i.e., connections to download HTML being dropped) or through direct blocking (i.e., web servers sending 403 error codes). Many of the countries included in this study are already marginalized from Open Science discussions due to political issues or infrastructural challenges. This study clearly indicates that access to FAIR data resources is influenced by a range of geo-political factors. Given the volatile nature of politics and the slow pace of infrastructural investment, this is likely to continue to be an issue and indeed may grow. We propose that it is essential for discussions and implementations of FAIR to include awareness of these issues of accessibility. Without this awareness, the expansion of FAIR data may unintentionally reinforce current access inequities and research inequalities around the globe.
Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation ...Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two
eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes.
The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for ...next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined.
We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records).
The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present.
The CODATA-RDA Schools for Research Data Science (SRDS) is a network of schools originating in the RDA in 2016. In 2019 it was recognized as an RDA output. To date, over 400 students from 40 ...countries have been trained in 10 schools. The majority of these students were postgraduates from low/middle-income countries (LMICs). In contrast to many other data science training approaches, the SRDS schools are designed to be 2-week, disciplinarily-agnostic, residential events where students are introduced to a broad range of tools requisite for efficient and responsible data-centric research. This paper presents the results of a survey carried out on alumni from schools held between 2016 and 2019 (45% response). The results of the survey strongly support the SRDS's long-term goals of facilitating data science training/capacity building within LMICs, and to foster communities of early career researchers (ECRs) conducting responsible and open data science research. The survey results demonstrated that 90% of respondent alumni continued to conduct research and make use of the skills acquired at the SRDS. Modules on open and responsible research and research data management were rated as important for future research. 79% of respondents confirmed that they maintained contact with peers, and 31% had set up academic collaborations with peers and/or instructors. Many had gone on to present content from the schools in their home institutions. The survey results clearly demonstrate the impact of the SRDS, and the value of an expanding network of schools supported by the RDA and CODATA. Keywords: RDA, CODATA, data science, school, alumni, survey
A method to detect DNA‐binding sites on the surface of a protein structure is important for functional annotation. This work describes the analysis of residue patches on the surface of DNA‐binding ...proteins and the development of a method of predicting DNA‐binding sites using a single feature of these surface patches. Surface patches and the DNA‐binding sites were initially analysed for accessibility, electrostatic potential, residue propensity, hydrophobicity and residue conservation. From this, it was observed that the DNA‐binding sites were, in general, amongst the top 10% of patches with the largest positive electrostatic scores. This knowledge led to the development of a prediction method in which patches of surface residues were selected such that they excluded residues with negative electrostatic scores. This method was used to make predictions for a data set of 56 non‐homologous DNA‐binding proteins. Correct predictions made for 68% of the data set.