Data storage in DNA has recently emerged as a promising archival solution, offering space-efficient and long-lasting digital storage solutions. Recent studies suggest leveraging the inherent ...redundancy of synthesis and sequencing technologies by using composite DNA alphabets. A major challenge of this approach involves the noisy inference process, obstructing large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering, in some implementations, a 6.5-fold increase in logical density over standard DNA-based storage systems, with near-zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter consists of a subset of shortmers. We formally define various combinatorial encoding schemes and investigate their theoretical properties. These include information density and reconstruction probabilities, as well as required synthesis and sequencing multiplicities. We then propose an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional (2D) error correction codes, and reconstruction algorithms, under different error regimes. We performed simulations and show, for example, that the use of 2D Reed-Solomon error correction has significantly improved reconstruction rates. We validated our approach by constructing two combinatorial sequences using Gibson assembly, imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage and describes some theoretical research questions and technical challenges. Combining combinatorial principles with error-correcting strategies, and investing in the development of DNA synthesis technologies that efficiently support combinatorial synthesis, can pave the way to efficient, error-resilient DNA-based storage solutions.
Controlling off-target editing activity is one of the central challenges in making CRISPR technology accurate and applicable in medical practice. Current algorithms for analyzing off-target activity ...do not provide statistical quantification, are not sufficiently sensitive in separating signal from noise in experiments with low editing rates, and do not address the detection of translocations. Here we present CRISPECTOR, a software tool that supports the detection and quantification of on- and off-target genome-editing activity from NGS data using paired treatment/control CRISPR experiments. In particular, CRISPECTOR facilitates the statistical analysis of NGS data from multiplex-PCR comparative experiments to detect and quantify adverse translocation events. We validate the observed results and show independent evidence of the occurrence of translocations in human cell lines, after genome editing. Our methodology is based on a statistical model comparison approach leading to better false-negative rates in sites with weak yet significant off-target activity.
Animals are grouped into ~35 'phyla' based upon the notion of distinct body plans. Morphological and molecular analyses have revealed that a stage in the middle of development--known as the ...phylotypic period--is conserved among species within some phyla. Although these analyses provide evidence for their existence, phyla have also been criticized as lacking an objective definition, and consequently based on arbitrary groupings of animals. Here we compare the developmental transcriptomes of ten species, each annotated to a different phylum, with a wide range of life histories and embryonic forms. We find that in all ten species, development comprises the coupling of early and late phases of conserved gene expression. These phases are linked by a divergent 'mid-developmental transition' that uses species-specific suites of signalling pathways and transcription factors. This mid-developmental transition overlaps with the phylotypic period that has been defined previously for three of the ten phyla, suggesting that transcriptional circuits and signalling mechanisms active during this transition are crucial for defining the phyletic body plan and that the mid-developmental transition may be used to define phylotypic periods in other phyla. Placing these observations alongside the reported conservation of mid-development within phyla, we propose that a phylum may be defined as a collection of species whose gene expression at the mid-developmental transition is both highly conserved among them, yet divergent relative to other species.
We use an oligonucleotide library of >10,000 variants to identify an insulation mechanism encoded within a subset of σ54 promoters. Insulation manifests itself as reduced protein expression for a ...downstream gene that is expressed by transcriptional readthrough. It is strongly associated with the presence of short CT-rich motifs (3–5 bp), positioned within 25 bp upstream of the Shine-Dalgarno (SD) motif of the silenced gene. We provide evidence that insulation is triggered by binding of the ribosome binding site (RBS) to the upstream CT-rich motif. We also show that, in E. coli, insulator sequences are preferentially encoded within σ54 promoters, suggesting an important regulatory role for these sequences in natural contexts. Our findings imply that sequence-specific regulatory effects that are sparsely encoded by short motifs may not be easily detected by lower throughput studies. Such sequence-specific phenomena can be uncovered with a focused oligo library (OL) design that mitigates sequence-related variance, as exemplified herein.
Display omitted
•Short CT-rich motifs (3–5 bases) are responsible for insulation effect•Insulation strength depends on the location and number of the insulator motifs•The insulator motifs are abundant within σ54 promoters in E. coli
Levy et al. identify a gene insulation phenomenon encoded within a subset of σ54 promoters in E. coli. The authors use an oligo library, sequencing, bioinformatics analysis, and a synthetic biology approach to show that a short CT-rich motif (3–5 bp) is responsible for the insulation phenomenon.
Single-cell transcriptomics requires a method that is sensitive, accurate, and reproducible. Here, we present CEL-Seq2, a modified version of our CEL-Seq method, with threefold higher sensitivity, ...lower costs, and less hands-on time. We implemented CEL-Seq2 on Fluidigm's C1 system, providing its first single-cell, on-chip barcoding method, and we detected gene expression changes accompanying the progression through the cell cycle in mouse fibroblast cells. We also compare with Smart-Seq to demonstrate CEL-Seq2's increased sensitivity relative to other available methods. Collectively, the improvements make CEL-Seq2 uniquely suited to single-cell RNA-Seq analysis in terms of economics, resolution, and ease of use.
Abstract
Motivation
Recent years have seen a growing number and an expanding scope of studies using synthetic oligo libraries for a range of applications in synthetic biology. As experiments are ...growing by numbers and complexity, analysis tools can facilitate quality control and support better assessment and inference.
Results
We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on NGS analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates and their dependence on sequence or library properties. SOLQC produces graphical reports from the analysis, in a flexible format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis.
Availability and implementation
SOLQC is a free software for non-commercial use, available at https://app.gitbook.com/@yoav-orlev/s/solqc/. For commercial use please contact the authors.
Supplementary information
Supplementary data are available at Bioinformatics online.
Using synthetic DNA for data storage and for physical information encoding in labeling, tracing, and authentication applications is becoming more feasible as synthesis and reading technologies are ...improving. DNA in data storage applications has several advantages such as very high physical density and robustness. Some of the new synthesis technologies lead to repetition noise, consisting of sticky insertions and deletions in the resulting messages. In this paper, we address reconstruction algorithms for multiple trace communication channels with repetition (sticky insertion and deletion) noise. We prove correctness and analyze failure rates, both analytically and on simulated data. We identify a failure mechanism related to alternating stretches in the design sequence that leads to a potential bias in the data derived from reads (traces) and used for reconstruction. To minimize this effect we introduce alternating length limited codes (ALL codes) and analyze some of their properties.
This study introduces a novel model for analyzing and determining the required sequencing coverage in DNA-based data storage, focusing on combinatorial DNA encoding. We seek to characterize the ...distribution of the number of sequencing reads required for message reconstruction. We use a variant of the coupon collector distribution for this purpose. For any given number of observed reads, <inline-formula> <tex-math notation="LaTeX">R\in \mathbb {N} </tex-math></inline-formula>, we use a Markov Chain representation of the process to compute the probability of error-free reconstruction. We develop theoretical bounds on the decoding probability and use empirical simulations to validate these bounds and assess tightness. This work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. We provide a Python package, with its input being the code design and other message parameters, all of which are denoted as <inline-formula> <tex-math notation="LaTeX">\boldsymbol {\Theta } </tex-math></inline-formula>, and a desired confidence level <inline-formula> <tex-math notation="LaTeX">1-\delta </tex-math></inline-formula>. This package computes the required read coverage, guaranteeing the message reconstruction <inline-formula> <tex-math notation="LaTeX">R=R(\delta,\boldsymbol {\Theta }) </tex-math></inline-formula>.
Recent developments in personalized medicine are based on molecular measurement steps that guide personally adjusted medical decisions. A central approach to molecular profiling consists of measuring ...DNA, RNA, and/or proteins in tissue samples, most notably in and around tumors. This measurement yields molecular biomarkers that are potentially predictive of response and of tumor type. Current methods in cancer therapy mostly use tissue biopsy as the starting point of molecular profiling. Tissue biopsies involve a physical resection of a small tissue sample, leading to localized tissue injury, bleeding, inflammation and stress, as well as to an increased risk of metastasis. Here we developed a technology for harvesting biomolecules from tissues using electroporation. We show that tissue electroporation, achieved using a combination of high-voltage short pulses, 50 pulses 500 V cm
, 30 µs, 1 Hz, with low-voltage long pulses 50 pulses 50 V cm
, 10 ms, delivered at 1 Hz, allows for tissue-specific extraction of RNA and proteins. We specifically tested RNA and protein extraction from excised kidney and liver samples and from excised HepG2 tumors in mice. Further in vivo development of extraction methods based on electroporation can drive novel approaches to the molecular profiling of tumors and of tumor environment and to related diagnosis practices.
Visual-Near-Infra-Red (VIS/NIR) spectroscopy has led the revolution in high-throughput phenotyping methods used to determine chemical and structural elements of organic materials. In the current ...state of the art, spectrophotometers used for imaging techniques are either very expensive or too large to be used as a field-operable device. In this study we developed a Sparse NIR Optimization method (SNIRO) that selects a pre-determined number of wavelengths that enable quantification of analytes in a given sample using linear regression. We compared the computed complexity time and the accuracy of SNIRO to Marten's test, to forward selection test and to LASSO all applied to the determination of protein content in corn flour and meat and octane number in diesel using publicly available datasets. In addition, for the first time, we determined the glucose content in the green seaweed Ulva sp., an important feedstock for marine biorefinery. The SNIRO approach can be used as a first step in designing a spectrophotometer that can scan a small number of specific spectral regions, thus decreasing, potentially, production costs and scanner size and enabling the development of field-operable devices for content analysis of complex organic materials.
Display omitted
•A Sparse NIR Optimization method (SNIRO) for selecting a given number of significant wavelengths from spectra was developed.•The computed complexity time and the accuracy of SNIRO was compared to Marten's test, to forward selection test and to LASSO.•SNIRO was used to determine protein content in corn flour and meat, and octane number in diesel using public NIR datasets.•SNIRO was used to determine the glucose content in the green seaweed Ulva sp.