SOAP2 is a significantly improved version of the short oligonucleotide alignment program that both reduces computer memory usage and increases alignment speed at an unprecedented rate. We used a ...Burrows Wheeler Transformation (BWT) compression index to substitute the seed strategy for indexing the reference sequence in the main memory. We tested it on the whole human genome and found that this new algorithm reduced memory usage from 14.7 to 5.4 GB and improved alignment speed by 20–30 times. SOAP2 is compatible with both single- and paired-end reads. Additionally, this tool now supports multiple text and compressed file formats. A consensus builder has also been developed for consensus assembly and SNP detection from alignment of short reads on a reference genome. Availability: http://soap.genomics.org.cn Contact: soap@genomics.org.cn
With the emergence of artificial intelligence, deep learning techniques have been widely deployed in forecasting stock markets. However, existing deep-learning-based models for news-based forecasts ...of stock trends are mostly black-box and difficult to explain. The procedure by which how final predictions are made within models keeps unknown, making it hard to interpret why one prediction should be better than the other. To provide explanations on predictions, this paper proposes to inject causal inference into model procedures and causally interpret predictions. We first generate a causal graph from financial news, and then integrate the information in the causal graph into a neural network model for stock trend prediction. Moreover, in order to better extract keywords from financial news we introduce a novel keyword extraction method named Distinguishable Word Filtering by Kolmogorov–Smirnov Test (DWF-KST). The experiment results on five financial datasets demonstrate that not only our proposed model explicitly provides an interpretation of prediction results, but also outperforms the state-of-art methods. the achieved results boost predictions of S&P 500 2-category from 89.7% to 97.4%, 3-category from 77.4% to 82.5%, and 5-category from 61.5% to 71.6%. For the other two indexes, the performances of Dow index improve from 86.2% to 90.2% and Nasdaq index improve from 76.4% to 78.9%.
•We propose a method for stock trend prediction.•We use causal inference based on unstructured data.•Improve the interpretability of DL-based stock prediction models.•We propose DWF-KST for keyword selection in financial news.
Summary
The pentatricopeptide repeat (PPR) proteins form one of the largest protein families in land plants. They are characterised by tandem 30–40 amino acid motifs that form an extended binding ...surface capable of sequence‐specific recognition of RNA strands. Almost all of them are post‐translationally targeted to plastids and mitochondria, where they play important roles in post‐transcriptional processes including splicing, RNA editing and the initiation of translation. A code describing how PPR proteins recognise their RNA targets promises to accelerate research on these proteins, but making use of this code requires accurate definition and annotation of all of the various nucleotide‐binding motifs in each protein. We have used a structural modelling approach to define 10 different variants of the PPR motif found in plant proteins, in addition to the putative deaminase motif that is found at the C‐terminus of many RNA‐editing factors. We show that the super‐helical RNA‐binding surface of RNA‐editing factors is potentially longer than previously recognised. We used the redefined motifs to develop accurate and consistent annotations of PPR sequences from 109 genomes. We report a high error rate in PPR gene models in many public plant proteomes, due to gene fusions and insertions of spurious introns. These consistently annotated datasets across a wide range of species are valuable resources for future comparative genomics studies, and an essential pre‐requisite for accurate large‐scale computational predictions of PPR targets. We have created a web portal (http://www.plantppr.com) that provides open access to these resources for the community.
Significance Statement
Accurate prediction of the RNA targets of pentatricopeptide repeat (PPR) proteins requires accurate annotation of their RNA‐binding motifs. Here we used structural modelling to define 10 different variants of PPR motifs in plant proteins and used these redefined motifs to develop improved annotations of PPRs from 109 genomes. These consistently annotated datasets are valuable resources for comparative genomics studies and for large‐scale prediction of PPR function.
Deep learning has attracted a lot of attention and has been applied successfully in many areas such as bioinformatics, imaging processing, game playing and computer security etc. On the other hand, ...deep learning usually requires a lot of training data which may not be provided by a sole owner. As the volume of data gets huge, it is common for users to store their data in a third-party cloud. Due to the confidentiality of the data, data are usually stored in encrypted form. To apply deep learning to these datasets owned by multiple data owners on cloud, we need to tackle two challenges: (i) the data are encrypted with different keys, all operations including intermediate results must be secure; and (ii) the computational cost and the communication cost of the data owner(s) should be kept minimal. In our work, we propose two schemes to solve the above problems. We first present a basic scheme based on multi-key fully homomorphic encryption (MK-FHE), then we propose an advanced scheme based on a hybrid structure by combining the double decryption mechanism and fully homomorphic encryption (FHE). We also prove that these two multi-key privacy-preserving deep learning schemes over encrypted data are secure.
•In the basic scheme, we use M-FHE as our privacy-preserving technique. Only the decrypt operation needs the interaction among data owners.•In the advanced scheme, we propose a hybrid structure scheme by combining the double decryption mechanism and FHE.•In the advanced scheme, only the encrypt and decrypt algorithms are performed by data providers.•We prove that these two multi-key privacy-preserving deep learning schemes over encrypted data are secure.
Federated learning, a privacy-preserving collaborative machine learning paradigm, has led to the proposal of various incentive mechanisms to encourage active participation of data owners. However, ...most of the existing mechanisms focused on the monopsony market scenario, where only one server-side entity (buyer) is involved. In real-world scenarios, multiple server parties may express simultaneous interest in the data of a client (seller), leading to a non-monopoly market. This paper aims to bridge this gap by introducing the concept of incentivizing federated learning in a non-monopoly market and presents a non-monopoly federated learning incentive mechanism, coined as NmFLI. NmFLI employs a double-auction mechanism to implement federated learning incentives and utilizes the Vickery–Clarke–Groves (VCG) mechanism to ensure client trustworthiness. Additionally, NmFLI devises a method for measuring data quality by calculating the value of clients based on their historical performance, which effectively balances accuracy and computational complexity. We demonstrate that NmFLI possesses properties such as individual rationality and strategy-proofness. Experimental results indicate that NmFLI can effectively incentivize federated learning and achieve higher accuracy than baseline models across various scenarios. For example, when the objectives of various tasks overlap, NmFLI outperforms the best baseline by 3.09% with imbalanced client data while maintaining the same data size. Moreover, NmFLI surpasses the best baseline by 6.12% with different amounts of client data.
Identification of antibiotic resistance genes from environmental samples has been a critical sub-domain of gene discovery which is directly connected to human health. However, it is drawing ...extraordinary attention in recent years and regarded as a severe threat to human health by many institutions around the world. To satisfy the needs for efficient ARG discovery, a series of online antibiotic resistance gene databases have been published. This article will conduct an in-depth analysis of CARD, one of the most widely used ARG databases.
The decision model of CARD is based the alignment score with a single ARG type. We discover the occasions where the model is likely to make false prediction, and then propose an optimization method on top of the current CARD model. The optimization is expected to raise the coherence with BLAST homology relationships and improve the confidence for identification of ARGs using the database.
The absence of public recognized benchmark makes it challenging to evaluate the performance of ARG identification. However, possible wrong predictions and methods for resolving the problem can be inferred by computational analysis of the identification method and the underlying reference sequences. We hope our work can bring insight to the mission of precise ARG type classifications.
Along with the popularity of outsourcing data to the cloud server, data privacy becomes a central consideration. Because encryption alone has been proved insecure for the leakages of access pattern, ...Oblivious RAM (ORAM) was proposed to protect where, when and how often the data block has been accessed. However, different types of ORAM implementations have different limitations in terms of significant bandwidth cost or massive storage space, making them impractical for some applications like Internet of Things (IoT).
In this paper, we present a practical ORAM, called HybridORAM, with constant bandwidth, which can be applied in wide application scopes. HybridORAM explores a new ORAM design to combine the advantages of layer and tree ORAMs; more specifically, it combines the frequently-accessed small levels of the former to improve the response time, and the small shuffle of the latter to save the storage capacity. Compared to the typical schemes, HybridORAM has an efficient response time reduced by O(log k), low bandwidth cost optimized from O(log N · B) to O(B) and small client storage, where k is level size factor, B is block size, N is the number of real blocks in ORAM. Experiments show that the response time of HybridORAM is 50.3% shorter than OnionORAM and 34.8% shorter than OS-PIR by practical parameters.
SOAP3 is the first short read alignment tool that leverages the multi-processors in a graphic processing unit (GPU) to achieve a drastic improvement in speed. We adapted the compressed full-text ...index (BWT) used by SOAP2 in view of the advantages and disadvantages of GPU. When tested with millions of Illumina Hiseq 2000 length-100 bp reads, SOAP3 takes < 30 s to align a million read pairs onto the human reference genome and is at least 7.5 and 20 times faster than BWA and Bowtie, respectively. For aligning reads with up to four mismatches, SOAP3 aligns slightly more reads than BWA and Bowtie; this is because SOAP3, unlike BWA and Bowtie, is not heuristic-based and always reports all answers.
The adversarial vulnerability of convolutional neural networks (CNNs) refers to the performance degradation of CNNs under adversarial attacks, leading to incorrect decisions. However, the causes of ...adversarial vulnerability in CNNs remain unknown. To address this issue, we propose a unique cross-scale analytical approach from a statistical physics perspective. It reveals that the huge amount of nonlinear effects inherent in CNNs is the fundamental cause for the formation and evolution of system vulnerability. Vulnerability is spontaneously formed on the macroscopic level after the symmetry of the system is broken through the nonlinear interaction between microscopic state order parameters. We develop a cascade failure algorithm, visualizing how micro perturbations on neurons' activation can cascade and influence macro decision paths. Our empirical results demonstrate the interplay between microlevel activation maps and macrolevel decision-making and provide a statistical physics perspective to understand the causality behind CNN vulnerability. Our work will help subsequent research to improve the adversarial robustness of CNNs.