Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the ...field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.
Hyperspectral image (HSI) and multispectral image (MSI) fusion, which fuses a low-spatial-resolution HSI (LR-HSI) with a higher resolution multispectral image (MSI), has become a common scheme to ...obtain high-resolution HSI (HR-HSI). This article presents a novel HSI and MSI fusion method (called as CNN-Fus), which is based on the subspace representation and convolutional neural network (CNN) denoiser, i.e., a well-trained CNN for gray image denoising. Our method only needs to train the CNN on the more accessible gray images and can be directly used for any HSI and MSI data sets without retraining. First, to exploit the high correlations among the spectral bands, we approximate the desired HR-HSI with the low-dimensional subspace multiplied by the coefficients, which can not only speed up the algorithm but also lead to more accurate recovery. Since the spectral information mainly exists in the LR-HSI, we learn the subspace from it via singular value decomposition. Due to the powerful learning performance and high speed of CNN, we use the well-trained CNN for gray image denoising to regularize the estimation of coefficients. Specifically, we plug the CNN denoiser into the alternating direction method of multipliers (ADMM) algorithm to estimate the coefficients. Experiments demonstrate that our method has superior performance over the state-of-the-art fusion methods.
Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant ...compressibility measure for repetitive texts is
r
, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used
O
(
r
) space and was able to efficiently count the number of occurrences of a pattern of length
m
in a text of length
n
(in
O
(
m
log log
n
) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of
r
. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the
occ
occurrences efficiently (in
O
(
occ
log log
n
) time) within
O
(
r
) space. By raising the space to
O
(
r
log log
n
), our index counts the occurrences in optimal time,
O
(
m
), and locates them in optimal time as well,
O
(
m
+
occ
). By further raising the space by an
O
(
w
/ log σ) factor, where σ is the alphabet size and
w
= Ω (log
n
) is the RAM machine size in bits, we support count and locate in
O
(⌈
m
log (σ)/
w
⌉) and
O
(⌈
m
log (σ)/
w
⌉ +
occ
) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using
O
(
r
log (
n
/
r
)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time
O
(log (
n
/
r
)+ℓ log (σ)/
w
). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time
O
(log (
n
/
r
)), and extend these capabilities to full suffix tree functionality, typically in
O
(log (
n
/
r
)) time per operation. Our experiments show that our
O
(
r
)-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.
Hierarchical Clustering Cohen-addad, Vincent; Kanade, Varun; Mallmann-trenn, Frederik ...
Journal of the ACM,
08/2019, Letnik:
66, Številka:
4
Journal Article
Recenzirano
Odprti dostop
Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on ...providing algorithms, rather than optimizing a specific objective, Dasgupta framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a “good” hierarchical clustering is one that minimizes a particular cost function 23. He showed that this cost function has certain desirable properties: To achieve optimal cost, disconnected components (namely, dissimilar elements) must be separated at higher levels of the hierarchy, and when the similarity between data elements is identical, all clusterings achieve the same cost.
We take an axiomatic approach to defining “good” objective functions for both similarity- and dissimilarity-based hierarchical clustering. We characterize a set of
admissible
objective functions having the property that when the input admits a “natural” ground-truth hierarchical clustering, the ground-truth clustering has an optimal value. We show that this set includes the objective function introduced by Dasgupta.
Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. We also initiate a beyond worst-case analysis of the complexity of the problem and design algorithms for this scenario.
Deep Neural Networks and Tabular Data: A Survey Borisov, Vadim; Leemann, Tobias; Sebler, Kathrin ...
IEEE transaction on neural networks and learning systems,
06/2024, Letnik:
35, Številka:
6
Journal Article
Odprti dostop
Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous datasets, deep neural networks ...have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains highly challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data and also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with 11 deep learning approaches across five popular real-world tabular datasets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.
Foundations for Smarter Cities Harrison, C.; Eckman, B.; Hamilton, R. ...
IBM journal of research and development,
7/2010, Letnik:
54, Številka:
4
Journal Article