As the next-generation sequencing (NGS) technologies producing hundreds of millions of reads every day, a tremendous computational challenge is to map NGS reads to a given reference genome ...efficiently. However, existing methods of all-mappers, which aim at finding all mapping locations of each read, are very time consuming. The majority of existing all-mappers consist of 2 main parts, filtration and verification. This work significantly reduces verification time, which is the dominant part of the running time.
An efficient all-mapper, BitMapper, is developed based on a new vectorized bit-vector algorithm, which simultaneously calculates the edit distance of one read to multiple locations in a given reference genome. Experimental results on both simulated and real data sets show that BitMapper is from several times to an order of magnitude faster than the current state-of-the-art all-mappers, while achieving higher sensitivity, i.e., better quality solutions.
We present BitMapper, which is designed to return all mapping locations of raw reads containing indels as well as mismatches. BitMapper is implemented in C under a GPL license. Binaries are freely available at http://home.ustc.edu.cn/%7Echhy.
Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the ...launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address these two concerns cohesively, we propose SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency. Moreover, it allows a better mix of dynamically-generated and original (parent) kernels for the scheduler to effectively hide the remaining overheads and improve the utilization of the GPU resources. Our results show that, across 13 benchmarks, SPAWN achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.
For several emerging memory technologies, a natural formulation of memory arrays (cross-point) provides nearly symmetric access costs along multiple (e.g., both row and column) dimensions in contrast ...to the row-oriented nature of most DRAM and SRAM implementations, producing a Multi-Dimensional-Access (MDA) memory. While MDA memories can directly support applications with both row and column preferences, most modern processors do not directly access either the rows or columns of memories: memory accesses proceed through a cache hierarchy that abstracts many of the physical features that supply the aforementioned symmetry. To reap the full benefits of MDA memories, a co-design approach must occur across software memory layout, the mapping between the physical and logical organization of the memory arrays, and the cache hierarchy itself in order to efficiently express, convey, and exploit multidimensional access patterns.
In this paper, we describe a taxonomy for different ways of connecting row and column preferences at the application level to an MDA memory through an MDA cache hierarchy and explore specific implementations for the most plausible design points. We extend vectorization support at the compiler level to provide the necessary information to extract preferences and provide compatible memory layouts, and evaluate the tradeoffs among multiple cache designs for the MDA memory systems. Our results indicate that both logically 2-D caching using physically 1-D SRAM structures and on-chip physically 2-D caches can both provide significant improvements in performance over a traditional cache system interfacing with an MDA memory, reducing execution time by 72% and 65%, respectively. We then explore the sensitivity of these benefits as a function of the working-set to cache capacity ratio as well as to MDA technology assumptions.
Modern day drug discovery is extremely expensive and time consuming. Although computational approaches help accelerate and decrease the cost of drug discovery, existing computational software ...packages for docking-based drug discovery suffer from both low accuracy and high latency. A few recent machine learning-based approaches have been proposed for virtual screening by improving the ability to evaluate protein–ligand binding affinity, but such methods rely heavily on conventional docking software to sample docking poses, which results in excessive execution latencies. Here, we propose and evaluate a novel graph neural network (GNN)-based framework, MedusaGraph, which includes both pose-prediction (sampling) and pose-selection (scoring) models. Unlike the previous machine learning-centric studies, MedusaGraph generates the docking poses directly and achieves from 10 to 100 times speedup compared to state-of-the-art approaches, while having a slightly better docking accuracy.
Virtual screening is a key enabler of computational drug discovery and requires accurate and efficient structure-based molecular docking. In this work, we develop algorithms and software building ...blocks for molecular docking that can take advantage of graphics processing units (GPUs). Specifically, we focus on MedusaDock, a flexible protein-small molecule docking approach and platform. We accelerate the performance of the coarse docking phase of MedusaDock, as this step constitutes nearly 70% of total running time in typical use-cases. We perform a comprehensive evaluation of the quality and performance with single-GPU and multi-GPU acceleration using a data set of 3875 protein–ligand complexes. The algorithmic ideas, data structure design choices, and performance optimization techniques shed light on GPU acceleration of other structure-based molecular docking software tools.
The high-performance computational techniques have brought significant benefits for drug discovery efforts in recent decades. One of the most challenging problems in drug discovery is the ...protein–ligand binding pose prediction. To predict the most stable structure of the complex, the performance of conventional structure-based molecular docking methods heavily depends on the accuracy of scoring or energy functions (as an approximation of affinity) for each pose of the protein–ligand docking complex to effectively guide the search in an exponentially large solution space. However, due to the heterogeneity of molecular structures, the existing scoring calculation methods are either tailored to a particular data set or fail to exhibit high accuracy. In this paper, we propose a convolutional neural network (CNN)-based model that learns to predict the stability factor of the protein–ligand complex and exhibits the ability of CNNs to improve the existing docking software. Evaluated results on PDBbind data set indicate that our approach reduces the execution time of the traditional docking-based method while improving the accuracy. Our code, experiment scripts, and pretrained models are available at https://github.com/j9650/MedusaNet.
A wireless sensor network (WSN) provides a barrier-coverage over an area of interest if no intruder can enter the area without being detected by the WSN. Recently, barrier-coverage model has received ...lots of attentions. In reality, sensor nodes are subject to fail to detect objects within its sensing range due to many reasons, and thus such a barrier of sensors may have temporal loopholes. In case of the WSN for border surveillance applications, it is reasonable to assume that the intruders are smart enough to identify such loopholes of the barrier to penetrate. Once a loophole is found, the other intruders have a good chance to use it continuously until the known path turns out to be insecure due to the increased security. In this paper, we investigate the potential of mobile sensor nodes such as unmanned aerial vehicles and human patrols to fortify the barrier-coverage quality of a WSN of cheap and static sensor nodes. For this purpose, we first use a single variable first-order grey model, GM(1,1), based on the intruder detection history from the sensor nodes to determine which parts of the barrier is more vulnerable. Then, we relocate the available mobile sensor nodes to the identified vulnerable parts of the barrier in a timely manner, and prove this relocation strategy is optimal. Throughout the simulations, we evaluate the effectiveness of our algorithm.
We propose a morphable convolution framework, which can be applied to irregularly shaped region of input feature map. This framework reduces the computational footprint of a regular CNN operation in ...the context of biomedical semantic image segmentation. The traditional CNN based approach has high accuracy, but suffers from high training and inference computation costs, compared to a conventional edge detection based approach. In this work, we combine the concept of morphable convolution with the edge detection algorithms resulting in a hierarchical framework, which first detects the edges and then generate a layer-wise annotation map. The annotation map guides the convolution operation to be run only on a small, useful fraction of pixels in the feature map. We evaluate our framework on three cell tracking datasets and the experimental results indicate that our framework saves ~30% and ~10% execution time on CPU and GPU, respectively, without loss of accuracy, compared to the baseline conventional CNN approaches.
The ever-growing complexity and popularity of machine learning and deep learning applications have motivated an urgent need of effective and efficient support for these applications on contemporary ...computing systems. In this paper, we thoroughly analyze the various DNN algorithms on three widely used architectures (CPU, GPU, and Xeon Phi). The DNN algorithms we choose for evaluation include i) Unet - for biomedical image segmentation, based on Convolutional Neural Network (CNN), ii) NMT - for neural machine translation based on Recurrent Neural Network (RNN), iii) ResNet-50, and iv) DenseNet - both for image processing based on CNNs. The ultimate goal of this paper is to answer four fundamental questions: i) whether the different DNN networks exhibit similar behavior on a given execution platform? ii) whether, across different platforms, a given DNN network exhibits different behaviors? iii) for the same execution platform and the same DNN network, whether different execution phases have different behaviors? and iv) are the current major general-purpose platforms tuned sufficiently well for different DNN algorithms? Motivated by these questions, we conduct an in-depth investigation of running DNN applications on modern systems. Specifically, we first identify the most time-consuming functions (hotspot functions) across different networks and platforms. Next, we characterize performance bottlenecks and discuss them in detail. Finally, we port selected hotspot functions to a cycle-accurate simulator, and use the results to direct architectural optimizations to better support DNN applications.
Recurrent Neural Networks (RNNs), more specifically their Long Short-Term Memory (LSTM) variants, have been widely used as a deep learning tool for tackling sequence-based learning tasks in text and ...speech. Training of such LSTM applications is computationally intensive due to the recurrent nature of hidden state computation that repeats for each time step. While sparsity in Deep Neural Nets has been widely seen as an opportunity for reducing computation time in both training and inference phases, the usage of non-ReLU activation in LSTM RNNs renders the opportunities for such dynamic sparsity associated with neuron activation and gradient values to be limited or non-existent. In this work, we identify dropout induced sparsity for LSTMs as a suitable mode of computation reduction. Dropout is a widely used regularization mechanism, which randomly drops computed neuron values during each iteration of training. We propose to structure dropout patterns, by dropping out the same set of physical neurons within a batch, resulting in column (row) level hidden state sparsity, which are well amenable to computation reduction at run-time in general-purpose SIMD hardware as well as systolic arrays. We conduct our experiments for three representative NLP tasks: language modelling on the PTB dataset, OpenNMT based machine translation using the IWSLT De-En and En-Vi datasets, and named entity recognition sequence labelling using the CoNLL-2003 shared task. We demonstrate that our proposed approach can be used to translate dropout-based computation reduction into reduced training time, with improvement ranging from 1.23x to 1.64x, without sacrificing the target metric.