Motivation: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated ...splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. Results: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. Availability: Opal, a multiple alignment tool that implements the best methods in our study, is freely available at http://opal.cs.arizona.edu Contact: twheeler@cs.arizona.edu
Cell signaling pathways, which are a series of reactions that start at receptors and end at transcription factors, are basic to systems biology. Properly modeling the reactions in such pathways ...requires directed hypergraphs, where an edge is now directed between two sets of vertices. Inferring a pathway by the most parsimonious series of reactions corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. The current state-of-the-art for shortest hyperpaths in cell signaling hypergraphs solves a mixed-integer linear program to find an optimal hyperpath that is restricted to be acyclic, and offers no efficiency guarantees.
We present, for the first time, a heuristic for general shortest hyperpaths that properly handles cycles, and is guaranteed to be efficient. We show the heuristic finds provably optimal hyperpaths for the class of singleton-tail hypergraphs, and also give a practical algorithm for tractably generating all source-sink hyperpaths. The accuracy of the heuristic is demonstrated through comprehensive experiments on all source-sink instances from the standard NCI-PID and Reactome pathway databases, which show it finds a hyperpath that matches the state-of-the-art mixed-integer linear program on over 99% of all instances that are acyclic. On instances where only cyclic hyperpaths exist, the heuristic surpasses the state-of-the-art, which finds no solution; on every such cyclic instance, enumerating all source-sink hyperpaths shows the solution found by the heuristic was in fact optimal.
The new shortest hyperpath heuristic is both fast and accurate. This makes finding source-sink hyperpaths, which in general may contain cycles, now practical for real cell signaling networks.
Source code for the hyperpath heuristic in a new tool we call Hhugin (as well as for hyperpath enumeration, and all dataset instances) is available free for non-commercial use at http://hhugin.cs.arizona.edu.
The unprecedented volume and rate of transient events that will be discovered by the Large Synoptic Survey Telescope (LSST) demand that the astronomical community update its follow-up paradigm. ...Alert-brokers-automated software system to sift through, characterize, annotate, and prioritize events for follow-up-will be critical tools for managing alert streams in the LSST era. The Arizona-NOAO Temporal Analysis and Response to Events System (ANTARES) is one such broker. In this work, we develop a machine learning pipeline to characterize and classify variable and transient sources only using the available multiband optical photometry. We describe three illustrative stages of the pipeline, serving the three goals of early, intermediate, and retrospective classification of alerts. The first takes the form of variable versus transient categorization, the second a multiclass typing of the combined variable and transient data set, and the third a purity-driven subtyping of a transient class. Although several similar algorithms have proven themselves in simulations, we validate their performance on real observations for the first time. We quantitatively evaluate our pipeline on sparse, unevenly sampled, heteroskedastic data from various existing observational campaigns, and demonstrate very competitive classification performance. We describe our progress toward adapting the pipeline developed in this work into a real-time broker working on live alert streams from time-domain surveys.
Abstract
We present here the design, architecture, and first data release for the Solar System Notification Alert Processing System (SNAPS). SNAPS is a solar system broker that ingests alert data ...from all-sky surveys. At present, we ingest data from the Zwicky Transient Facility (ZTF) public survey, and we will ingest data from the forthcoming Legacy Survey of Space and Time (LSST) when it comes online. SNAPS is an official LSST downstream broker. In this paper we present the SNAPS design goals and requirements. We describe the details of our automatic pipeline processing in which the physical properties of asteroids are derived. We present SNAPShot1, our first data release, which contains 5,458,459 observations of 31,693 asteroids observed by ZTF from 2018 July to 2020 May. By comparing a number of derived properties for this ensemble to previously published results for overlapping objects we show that our automatic processing is highly reliable. We present a short list of science results, among many that will be enabled by our SNAPS catalog: (1) we demonstrate that there are no known asteroids with very short periods and high amplitudes, which clearly indicates that in general asteroids in the size range 0.3–20 km are strengthless; (2) we find no difference in the period distributions of Jupiter Trojan asteroids, implying that the L4 and L5 clouds have different shape distributions; and (3) we highlight several individual asteroids of interest. Finally, we describe future work for SNAPS and our ability to operate at LSST scale.
Abstract
We describe the Arizona-NOIRLab Temporal Analysis and Response to Events System (ANTARES), a software instrument designed to process large-scale streams of astronomical time-domain alerts. ...With the advent of large-format CCDs on wide-field imaging telescopes, time-domain surveys now routinely discover tens of thousands of new events each night, more than can be evaluated by astronomers alone. The ANTARES event broker will process alerts, annotating them with catalog associations and filtering them to distinguish customizable subsets of events. We describe the data model of the system, the overall architecture, annotation, implementation of filters, system outputs, provenance tracking, system performance, and the user interface.
Abstract
Motivation
A factory in a metabolic network specifies how to produce target molecules from source compounds through biochemical reactions, properly accounting for reaction stoichiometry to ...conserve or not deplete intermediate metabolites. While finding factories is a fundamental problem in systems biology, available methods do not consider the number of reactions used, nor address negative regulation.
Methods
We introduce the new problem of finding optimal factories that use the fewest reactions, for the first time incorporating both first- and second-order negative regulation. We model this problem with directed hypergraphs, prove it is NP-complete, solve it via mixed-integer linear programming, and accommodate second-order negative regulation by an iterative approach that generates next-best factories.
Results
This optimization-based approach is remarkably fast in practice, typically finding optimal factories in a few seconds, even for metabolic networks involving tens of thousands of reactions and metabolites, as demonstrated through comprehensive experiments across all instances from standard reaction databases.
Availability and implementation
Source code for an implementation of our new method for optimal factories with negative regulation in a new tool called Odinn, together with all datasets, is available free for non-commercial use at http://odinn.cs.arizona.edu.
Abstract
Motivation
Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure ...prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a so-called template-based method has the potential to significantly boost prediction accuracy.
Method
We present a new hybrid approach to secondary structure prediction that gains the advantages of both template- and non-template-based methods. Our core template-based method is an algorithmic approach that uses metric-space nearest neighbor search over a template database of fixed-length amino acid words to determine estimated class-membership probabilities for each residue in the protein. These probabilities are then input to a dynamic programming algorithm that finds a physically valid maximum-likelihood prediction for the entire protein. Our hybrid approach exploits a novel accuracy estimator for our core method, which estimates the unknown true accuracy of its prediction, to discern when to switch between template- and non-template-based methods.
Results
On challenging CASP benchmarks, the resulting hybrid approach boosts the state-of-the-art Q8 accuracy by more than 2–10%, and Q3 accuracy by more than 1–3%, yielding the most accurate method currently available for both 3- and 8-state secondary structure prediction.
Availability and implementation
A preliminary implementation in a new tool we call Nnessy is available free for non-commercial use at http://nnessy.cs.arizona.edu.