Making error-free, custom DNA assemblies from potentially faulty building blocks is a fundamental challenge in synthetic biology. Here, we show how recursion can be used to address this challenge ...using a recursive procedure that constructs error-free DNA molecules and their libraries from error-prone synthetic oligonucleotides and naturally existing DNA. Specifically, we describe how divide and conquer (D&C), the quintessential recursive problem-solving technique, is applied in silico to divide target DNA sequences into overlapping, albeit error prone, oligonucleotides, and how recursive construction is applied in vitro to combine them to form error-prone DNA molecules. To correct DNA sequence errors, error-free fragments of these molecules are then identified, extracted, and used as new, typically longer and more accurate, inputs to another iteration of the recursive construction procedure; the entire process repeats until an error-free target molecule is formed. The method allows combining synthetic and natural DNA fragments into error-free designer DNA libraries, thus providing a foundation for the design and construction of complex synthetic DNA assemblies.
Entity Matching is an essential part of all real-world systems that take in structured and unstructured data coming from different sources. Typically no common key is available for connecting ...records. Massive data cleaning and integration processes require completion before any data analytics, or further processing can be performed. Although record linkage is frequently regarded as a somewhat tedious but necessary step, it reveals valuable insights, supports data visualization, and guides further analytic approaches to the data. Here, we focus on organization entity matching. We introduce CompanyName2Vec, a novel algorithm to solve company entity matching (CEM) using a neural network model to learn company name semantics from a job ad corpus, without relying on any information on the matched company besides its name. Based on a real-world data, we show that CompanyName2Vec outperforms other evaluated methods and solves the CEM challenge with an average success rate of 89.3%.
This work considers the problem of reconstructing a phylogenetic tree from
triplet-dissimilarities, which are dissimilarities defined over taxon-triplets. Triplet-dissimilarities are possibly the ...simplest generalization of pairwise dissimilarities, and were used for phylogenetic reconstructions in the past few years. We study the hardness of finding a tree best fitting a given triplet-dissimilarity table under the
ℓ
∞
norm. We show that the corresponding decision problem is NP-hard and that the corresponding optimization problem cannot be approximated in polynomial time within a constant multiplicative factor smaller than 1.4. On the positive side, we present a polynomial time constant-rate approximation algorithm for this problem. We also address the issue of best-fit under
maximal distortion, which corresponds to the largest
ratio between matching entries in two triplet-dissimilarity tables. We show that it is NP-hard to approximate the corresponding optimization problem within any constant multiplicative factor.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Reconstructing phylogenetic trees efficiently and accurately from distance estimates is an ongoing challenge in computational biology from both practical and theoretical considerations. We study ...algorithms which are based on a characterization of edge-weighted trees by distances to LCAs (Least Common Ancestors). This characterization enables a direct application of ultrametric reconstruction techniques to trees which are not necessarily ultrametric. A simple and natural neighbor joining criterion based on this observation is used to provide a family of efficient neighbor-joining algorithms. These algorithms are shown to reconstruct a refinement of the Buneman tree, which implies optimal robustness to noise under criteria defined by Atteson. In this sense, they outperform many popular algorithms such as Saitou and Nei's NJ. One member of this family is used to provide a new simple version of the 3-approximation algorithm for the closest additive metric under the iota (infinity) norm. A byproduct of our work is a novel technique which yields a time optimal O (n (2)) implementation of common clustering algorithms such as UPGMA.
Entity Matching is an essential part of all real-world systems that take in structured and unstructured data coming from different sources. Typically no common key is available for connecting ...records. Massive data cleaning and integration processes require completion before any data analytics, or further processing can be performed. Although record linkage is frequently regarded as a somewhat tedious but necessary step, it reveals valuable insights, supports data visualization, and guides further analytic approaches to the data. Here, we focus on organization entity matching. We introduce CompanyName2Vec, a novel algorithm to solve company entity matching (CEM) using a neural network model to learn company name semantics from a job ad corpus, without relying on any information on the matched company besides its name. Based on a real-world data, we show that CompanyName2Vec outperforms other evaluated methods and solves the CEM challenge with an average success rate of 89.3%.
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we generated high-quality genome sequences from three gray wolves, one from each of the ...three putative centers of dog domestication, two basal dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. Analysis of these sequences supports a demographic model in which dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow. In dogs, the domestication bottleneck involved at least a 16-fold reduction in population size, a much more severe bottleneck than estimated previously. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was substantially larger than represented by modern wolf populations. We narrow the plausible range for the date of initial dog domestication to an interval spanning 11-16 thousand years ago, predating the rise of agriculture. In light of this finding, we expand upon previous work regarding the increase in copy number of the amylase gene (AMY2B) in dogs, which is believed to have aided digestion of starch in agricultural refuse. We find standing variation for amylase copy number variation in wolves and little or no copy number increase in the Dingo and Husky lineages. In conjunction with the estimated timing of dog origins, these results provide additional support to archaeological finds, suggesting the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that, surprisingly, none of the extant wolf lineages from putative domestication centers is more closely related to dogs, and, instead, the sampled wolves form a sister monophyletic clade. This result, in combination with dog-wolf admixture during the process of domestication, suggests that a re-evaluation of past hypotheses regarding dog origins is necessary.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK