Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data ...types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast
Saccharomyces cerevisiae
. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.
In this study we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which the changes affect the performance of integrated analyses in predicting the annotation of nodes over time.
Biological experiments give insight into networks of processes inside a cell, but are subject to error and uncertainty. However, due to the overlap between the large number of experiments reported in ...public databases it is possible to assess the chances of individual observations being correct. In order to do so, existing methods rely on high-quality 'gold standard' reference networks, but such reference networks are not always available.
We present a novel algorithm for computing the probability of network interactions that operates without gold standard reference data. We show that our algorithm outperforms existing gold standard-based methods. Finally, we apply the new algorithm to a large collection of genetic interaction and protein-protein interaction experiments.
The integrated dataset and a reference implementation of the algorithm as a plug-in for the Ondex data integration framework are available for download at http://bio-nexus.ncl.ac.uk/projects/nogold/
The development of affordable, high-throughput sequencing technology has led to a flood of publicly available bacterial genome-sequence data. The availability of multiple genome sequences presents ...both an opportunity and a challenge for microbiologists, and new computational approaches are needed to extract the knowledge that is required to address specific biological problems and to analyse genomic data. The field of e-Science is maturing, and Grid-based technologies can help address this challenge.
The spread of drug resistance amongst clinically-important bacteria is a serious, and growing, problem 1. However, the analysis of entire genomes requires considerable computational effort, usually ...including the assembly of the genome and subsequent identification of genes known to be important in pathology. An alternative approach is to use computational algorithms to identify genomic differences between pathogenic and non-pathogenic bacteria, even without knowing the biological meaning of those differences. To overcome this problem, a range of techniques for dimensionality reduction have been developed. One such approach is known as latent-variable models 2. In latent-variable models dimensionality reduction is achieved by representing a high-dimensional data by a few hidden or latent variables, which are not directly observed but inferred from the observed variables present in the model. Probabilistic Latent Semantic Indexing (PLSA) is an extention of LSA 3. PLSA is based on a mixture decomposition derived from a latent class model. The main objective of the algorithm, as in LSA, is to represent high-dimensional co-occurrence information in a lower-dimensional way in order to discover the hidden semantic structure of the data using a probabilistic framework.
In this work we applied the PLSA approach to analyse the common genomic features in methicillin resistant Staphylococcus aureus, using tokens derived from amino acid sequences rather than DNA. We characterised genome-scale amino acid sequences in terms of their components, and then investigated the relationships between genomes and tokens and the phenotypes they generated. As a control we used the non-pathogenic model Gram-positive bacterium Bacillus subtilis.
The development of affordable, high-throughput sequencing technology has led to a flood of publicly available bacterial genome-sequence data. The availability of multiple genome sequences presents ...both an opportunity and a challenge for microbiologists, and new computational approaches are needed to extract the knowledge that is required to address specific biological problems and to analyse genomic data. The field of e-Science is maturing, and Grid-based technologies can help address this challenge.
Cellular senescence might be a tumour suppressing mechanism as well as a contributor to age-related loss of tissue function. It has been characterised classically as the result of the loss of DNA ...sequences called telomeres at the end of chromosomes. However, recent studies have revealed that senescence is in fact an intricate process, involving the sequential activation of multiple cellular processes, which have proven necessary for the establishment and maintenance of the phenotype. Here, we review some of these processes, namely, the role of mitochondrial function and reactive oxygen species, senescence-associated secreted proteins and chromatin remodelling. Finally, we illustrate the use of systems biology to address the mechanistic, functional and biochemical complexity of senescence.
As high-throughput technologies become cheaper and easier to use, raw sequence data and corresponding annotations for many organisms are becoming available. However, sequence data alone is not ...sufficient to explain the biological behaviour of organisms, which arises largely from complex molecular interactions. There is a need to develop new platform technologies that can be applied to the investigation of whole-genome datasets in an efficient and cost-effective manner. One such approach is the transfer of existing knowledge from well-studied organisms to closely-related organisms. In this paper, we describe a system, BacillusRegNet, for the use of a model organism, Bacillus subtilis, to infer genome-wide regulatory networks in less well-studied close relatives. The putative transcription factors, their binding sequences and predicted promoter sequences along with annotations are available from the associated BacillusRegNet website (http://bacillus.ncl.ac.uk).
The rapid and cost-effective identification of bacterial species is crucial, especially for clinical diagnosis and treatment. Peptide aptamers have been shown to be valuable for use as a component of ...novel, direct detection methods. These small peptides have a number of advantages over antibodies, including greater specificity and longer shelf life. These properties facilitate their use as the detector components of biosensor devices. However, the identification of suitable aptamer targets for particular groups of organisms is challenging. We present a semi-automated processing pipeline for the identification of candidate aptamer targets from whole bacterial genome sequences. The pipeline can be configured to search for protein sequence fragments that uniquely identify a set of strains of interest. The system is also capable of identifying additional organisms that may be of interest due to their possession of protein fragments in common with the initial set. Through the use of Cloud computing technology and distributed databases, our system is capable of scaling with the rapidly growing genome repositories, and consequently of keeping the resulting data sets up-to-date. The system described is also more generically applicable to the discovery of specific targets for other diagnostic approaches such as DNA probes, PCR primers and antibodies.
Biological systems are inherently stochastic, a fact which is often ignored when simulating genetic circuits. Synthetic biology aims to design genetic circuits de novo, and cannot therefore afford to ...ignore the effects of stochastic behavior. Since computational design tools will be essential for large-scale synthetic biology, it is important to develop an understanding of the role of stochasticity in molecular biology, and incorporate this understanding into computational tools for genetic circuit design. We report upon an investigation into the combination of evolutionary algorithms and stochastic simulation for genetic circuit design, to design regulatory systems based on the Bacillus subtilis sin operon.