Despite almost 40 years of molecular genetics research in Escherichia coli a major fraction of its Transcription Start Sites (TSSs) are still unknown, limiting therefore our understanding of the ...regulatory circuits that control gene expression in this model organism. RegulonDB (http://regulondb.ccg.unam.mx/) is aimed at integrating the genetic regulatory network of E. coli K12 as an entirely bioinformatic project up till now. In this work, we extended its aims by generating experimental data at a genome scale on TSSs, promoters and regulatory regions. We implemented a modified 5' RACE protocol and an unbiased High Throughput Pyrosequencing Strategy (HTPS) that allowed us to map more than 1700 TSSs with high precision. From this collection, about 230 corresponded to previously reported TSSs, which helped us to benchmark both our methodologies and the accuracy of the previous mapping experiments. The other ca 1500 TSSs mapped belong to about 1000 different genes, many of them with no assigned function. We identified promoter sequences and type of sigma factors that control the expression of about 80% of these genes. As expected, the housekeeping sigma(70) was the most common type of promoter, followed by sigma(38). The majority of the putative TSSs were located between 20 to 40 nucleotides from the translational start site. Putative regulatory binding sites for transcription factors were detected upstream of many TSSs. For a few transcripts, riboswitches and small RNAs were found. Several genes also had additional TSSs within the coding region. Unexpectedly, the HTPS experiments revealed extensive antisense transcription, probably for regulatory functions. The new information in RegulonDB, now with more than 2400 experimentally determined TSSs, strengthens the accuracy of promoter prediction, operon structure, and regulatory networks and provides valuable new information that will facilitate the understanding from a global perspective the complex and intricate regulatory network that operates in E. coli.
The evolution of bacterial pathogenicity, heavily influenced by horizontal gene transfer, provides new virulence factors and regulatory connections that alter bacterial phenotypes. Salmonella ...pathogenicity islands 1 and 2 (SPI-1 and SPI-2) are chromosomal regions that were acquired at different evolutionary times and are essential for Salmonella virulence. In the intestine of mammalian hosts, Salmonella expresses the SPI-1 genes that mediate its invasion to the gut epithelium. Once inside the cells, Salmonella down-regulates the SPI-1 genes and induces the expression of the SPI-2 genes, which favor its intracellular replication. The mechanism by which the invasion machinery is deactivated following successful invasion of host cells is not known. Here, we show that the SPI-2 encoded transcriptional regulator SsrB, which positively controls SPI-2, acts as a dual regulator that represses expression of SPI-1 during intracellular stages of infection. The mechanism of this SPI-1 repression by SsrB was direct and acts upon the hilD and hilA regulatory genes. The phenotypic effect of this molecular switch activity was a significant reduction in invasion ability of S. enterica serovar Typhimurium while promoting the expression of genes required for intracellular survival. During mouse infections, Salmonella mutants lacking SsrB had high levels of hilA (SPI-1) transcriptional activity whereas introducing a constitutively active SsrB led to significant hilA repression. Thus, our results reveal a novel SsrB-mediated mechanism of transcriptional crosstalk between SPI-1 and SPI-2 that helps Salmonella transition to the intracellular lifestyle.
Post-genomic implementations have expanded the experimental strategies to identify elements involved in the regulation of transcription initiation. Here, we present for the first time a detailed ...analysis of the sources of knowledge supporting the collection of transcriptional regulatory interactions (RIs) of
K-12. An RI groups the transcription factor, its effect (positive or negative) and the regulated target, a promoter, a gene or transcription unit. We improved the evidence codes so that specific methods are incorporated and classified into independent groups. On this basis we updated the computation of confidence levels, weak, strong, or confirmed, for the collection of RIs. These updates enabled us to map the RI set to the current collection of HT TF-binding datasets from ChIP-seq, ChIP-exo, gSELEX and DAP-seq in RegulonDB, enriching in this way the evidence of close to one-quarter (1329) of RIs from the current total 5446 RIs. Based on the new computational capabilities of our improved annotation of evidence sources, we can now analyze the internal architecture of evidence, their categories (experimental, classical, HT, computational), and confidence levels. This is how we know that the joint contribution of HT and computational methods increase the overall fraction of reliable RIs (the sum of confirmed and strong evidence) from 49% to 71%. Thus, the current collection has 3912 reliable RIs, with 2718 or 70% of them with classical evidence which can be used to benchmark novel HT methods. Users can selectively exclude the method they want to benchmark, or keep for instance only the confirmed interactions. The recovery of regulatory sites in RegulonDB by the different HT methods ranges between 33% by ChIP-exo to 76% by ChIP-seq although as discussed, many potential confounding factors limit their interpretation. The collection of improvements reported here provides a solid foundation to incorporate new methods and data, and to further integrate the diverse sources of knowledge of the different components of the transcriptional regulatory network. There is no other genomic database that offers this comprehensive high-quality architecture of knowledge supporting a corpus of transcriptional regulatory interactions.
Our understanding of the regulation of gene expression has benefited from the availability of high-throughput technologies that interrogate the whole genome for the binding of specific transcription ...factors and gene expression profiles. In the case of widely used model organisms, such as Escherichia coli K-12, the new knowledge gained from these approaches needs to be integrated with the legacy of accumulated knowledge from genetic and molecular biology experiments conducted in the pre-genomic era in order to attain the deepest level of understanding possible based on the available data.
In this paper, we describe an expansion of RegulonDB, the database containing the rich legacy of decades of classic molecular biology experiments supporting what we know about gene regulation and operon organization in E. coli K-12, to include the genome-wide dataset collections from 32 ChIP and 19 gSELEX publications, in addition to around 60 genome-wide expression profiles relevant to the functional significance of these datasets and used in their curation. Three essential features for the integration of this information coming from different methodological approaches are: first, a controlled vocabulary within an ontology for precisely defining growth conditions; second, the criteria to separate elements with enough evidence to consider them involved in gene regulation from isolated transcription factor binding sites without such support; and third, an expanded computational model supporting this knowledge. Altogether, this constitutes the basis for adequately gathering and enabling the comparisons and integration needed to manage and access such wealth of knowledge.
This version 10.0 of RegulonDB is a first step toward what should become the unifying access point for current and future knowledge on gene regulation in E. coli K-12. Furthermore, this model platform and associated methodologies and criteria can be emulated for gathering knowledge on other microbial organisms.
The small RNAs CsrB and CsrC of Salmonella indirectly control the expression of numerous genes encoding widespread cellular functions, including virulence. The expression of csrB and csrC genes, ...which are located in different chromosomal regions, is coordinated by positive transcriptional control mediated by the two-component regulatory system BarA/SirA. Here, we identified by computational analysis an 18-bp inverted repeat (IR) sequence located far upstream from the promoter of Salmonella enterica serovar Typhimurium csrB and csrC genes. Deletion analysis and site-directed mutagenesis of the csrB and csrC regulatory regions revealed that this IR sequence is required for transcriptional activation of both genes. Protein-DNA and protein-protein interaction assays showed that the response regulator SirA specifically binds to the IR sequence and provide evidence that SirA acts as a dimer. Interestingly, whereas the IR sequence was essential for the SirA-mediated expression of csrB, our results revealed that SirA controls the expression of csrC not only by binding to the IR sequence but also by an indirect mode involving the Csr system. Additional computational, biochemical, and genetic analyses demonstrated that the integration host factor (IHF) global regulator positively controls the expression of csrB, but not of csrC, by interacting with a sequence located between the promoter and the SirA-binding site. These findings contribute to the better understanding of the regulatory mechanism controlling the expression of CsrB and CsrC.
The rich knowledge of operon organization in Escherichia coli, together with the completed chromosomal sequence of this bacterium, enabled us to perform an analysis of distances between genes and of ...functional relationships of adjacent genes in the same operon, as opposed to adjacent genes in different transcription units. We measured and demonstrated the expected tendencies of genes within operons to have much shorter intergenic distances than genes at the borders of transcription units. A clear peak at short distances between genes in the same operon contrasts with a flat frequency distribution of genes at the borders of transcription units. Also, genes in the same operon tend to have the same physiological functional class. The results of these analyses were used to implement a method to predict the genomic organization of genes into transcription units. The method has a maximum accuracy of 88% correct identification of pairs of adjacent genes to be in an operon, or at the borders of transcription units, and correctly identifies around 75% of the known transcription units when used to predict the transcription unit organization of the E. coli genome. Based on the frequency distance distributions, we estimated a total of 630 to 700 operons in E. coli. This step opens the possibility of predicting operon organization in other bacteria whose genome sequences have been finished.
The transcriptional regulatory network of Escherichia coli K-12 is among the best studied gene networks of any living cell. Transcription factors bind to DNA either with their effector bound (holo ...conformation), or as a free protein (apo conformation) regulating transcription initiation. By using RegulonDB, the functional conformations (holo or apo) of transcription factors, and their mode of regulation (activator, repressor, or dual) were exhaustively analyzed. We report a striking discovery in the architecture of the regulatory network, finding a strong under-representation of the apo conformation (without allosteric metabolite) of transcription factors when binding to their DNA sites to activate transcription. This observation is supported at the level of individual regulatory interactions on promoters, even if we exclude the promoters regulated by global transcription factors, where three-quarters of the known promoters are regulated by a transcription factor in holo conformation. This genome-scale analysis enables us to ask what are the implications of these observations for the physiology and for our understanding of the ecology of E. coli. We discuss these ideas within the framework of the demand theory of gene regulation.
Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art ...Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12.
Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners.
Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.
The active and inactive state of transcription factors in growing cells is usually directed by allosteric physicochemical signals or metabolites, which are in turn either produced in the cell or ...obtained from the environment by the activity of the products of effector genes. To understand the regulatory dynamics and to improve our knowledge about how transcription factors (TFs) respond to endogenous and exogenous signals in the bacterial model, Escherichia coli, we previously proposed to classify TFs into external, internal and hybrid sensing classes depending on the source of their allosteric or equivalent metabolite. Here we analyze how a cell uses its topological structures in the context of sensing machinery and show that, while feed forward loops (FFLs) tightly integrate internal and external sensing TFs connecting TFs from different layers of the hierarchical transcriptional regulatory network (TRN), bifan motifs frequently connect TFs belonging to the same sensing class and could act as a bridge between TFs originating from the same level in the hierarchy. We observe that modules identified in the regulatory network of E. coli are heterogeneous in sensing context with a clear combination of internal and external sensing categories depending on the physiological role played by the module. We also note that propensity of two-component response regulators increases at promoters, as the number of TFs regulating a target operon increases. Finally we show that evolutionary families of TFs do not show a tendency to preserve their sensing abilities. Our results provide a detailed panorama of the topological structures of E. coli TRN and the way TFs they compose off, sense their surroundings by coordinating responses.