We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query
q
and the skew in the data, we study ...tradeoffs between the number of processors, the number of rounds of communication, and the per-processor
load
—the number of bits each processor can send or can receive in a single round—that are required to compute
q
. Since each processor must store its received bits, the load is at most the number of bits of storage per processor.
When the data are free of skew, we obtain essentially tight upper and lower bounds for one round algorithms, and we show how the bounds degrade when there is skew in the data. In the case of skewed data, we show how to improve the algorithms when approximate degrees of the (necessarily small number of) heavy-hitter elements are available, obtaining essentially optimal algorithms for queries such as skewed simple joins and skewed triangle join queries.
For queries that we identify as
treelike
, we also prove nearly matching upper and lower bounds for multi-round algorithms for a natural class of skew-free databases. One consequence of these latter lower bounds is that for any ε > 0, using
p
processors to compute the connected components of a graph, or to output the path, if any, between a specified pair of vertices of a graph with
m
edges and per-processor load that is
O
(
m
/
p
1−ε
) requires Ω(log
p
) rounds of communication.
Our upper bounds are given by simple structured algorithms using MapReduce. Our one-round lower bounds are proved in a very general model, which we call the
Massively Parallel Communication (MPC)
model, that allows processors to communicate arbitrary bits. Our multi-round lower bounds apply in a restricted version of the MPC model in which processors in subsequent rounds after the first communication round are only allowed to send tuples.
Integrity constraints such as functional dependencies (FD) and multi-valued
dependencies (MVD) are fundamental in database schema design. Likewise,
probabilistic conditional independences (CI) are ...crucial for reasoning about
multivariate probability distributions. The implication problem studies whether
a set of constraints (antecedents) implies another constraint (consequent), and
has been investigated in both the database and the AI literature, under the
assumption that all constraints hold exactly. However, many applications today
consider constraints that hold only approximately. In this paper we define an
approximate implication as a linear inequality between the degree of
satisfaction of the antecedents and consequent, and we study the relaxation
problem: when does an exact implication relax to an approximate implication? We
use information theory to define the degree of satisfaction, and prove several
results. First, we show that any implication from a set of data dependencies
(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most
quadratic in the number of variables; when the consequent is an FD, the factor
can be reduced to 1. Second, we prove that there exists an implication between
CIs that does not admit any relaxation; however, we prove that every
implication between CIs relaxes "in the limit". Then, we show that the
implication problem for differential constraints in market basket analysis also
admits a relaxation with a factor equal to 1. Finally, we show how some of the
results in the paper can be derived using the I-measure theory, which relates
between information theoretic measures and set theory. Our results recover, and
sometimes extend, previously known results about the implication problem: the
implication of MVDs and FDs can be checked by considering only 2-tuple
relations.
XPath is a language for navigating an XML document and selecting a set of element nodes. XPath expressions are used to query XML data, describe key constraints, express transformations, and reference ...elements in remote documents. This article studies the containment and equivalence problems for a fragment of the XPath query language, with applications in all these contexts.In particular, we study a class of XPath queries that contain branching, label wildcards and can express descendant relationships between nodes. Prior work has shown that languages that combine any two of these three features have efficient containment algorithms. However, we show that for the combination of features, containment is coNP-complete. We provide a sound and complete algorithm for containment that runs in exponential time, and study parameterized PTIME special cases. While we identify one parameterized class of queries for which containment can be decided efficiently, we also show that even with some bounded parameters, containment remains coNP-complete. In response to these negative results, we describe a sound algorithm that is efficient for all queries, but may return false negatives in some cases.
We present a constant-round algorithm in the massively parallel computation
(MPC) model for evaluating a natural join where every input relation has two
attributes. Our algorithm achieves a load of ...$\tilde{O}(m/p^{1/\rho})$ where
$m$ is the total size of the input relations, $p$ is the number of machines,
$\rho$ is the join's fractional edge covering number, and $\tilde{O}(.)$ hides
a polylogarithmic factor. The load matches a known lower bound up to a
polylogarithmic factor. At the core of the proposed algorithm is a new theorem
(which we name the "isolated cartesian product theorem") that provides fresh
insight into the problem's mathematical structure. Our result implies that the
subgraph enumeration problem, where the goal is to report all the occurrences
of a constant-sized subgraph pattern, can be settled optimally (up to a
polylogarithmic factor) in the MPC model.
We study the complexity of computing a query on a probabilistic database. We consider unions of conjunctive queries, UCQ, which are equivalent to positive, existential First Order Logic sentences, ...and also to nonrecursive datalog programs. The tuples in the database are independent random events. We prove the following dichotomy theorem. For every UCQ query, either its probability can be computed in polynomial time in the size of the database, or is #P-hard. Our result also has applications to the problem of computing the probability of positive, Boolean expressions, and establishes a dichotomy for such classes based on their structure. For the tractable case, we give a very simple algorithm that alternates between two steps: applying the inclusion/exclusion formula, and removing one existential variable. A key and novel feature of this algorithm is that it avoids computing terms that cancel out in the inclusion/exclusion formula, in other words it only computes those terms whose Mobius function in an appropriate lattice is nonzero. We show that this simple feature is a key ingredient needed to ensure completeness. For the hardness proof, we give a reduction from the counting problem for positive, partitioned 2CNF, which is known to be #P-complete. The hardness proof is nontrivial, and combines techniques from logic, classical algebra, and analysis.
We describe a framework for supporting arbitrarily complex SQL queries with 'uncertain' predicates. The query semantics is based on a probabilistic model and the results are ranked, much like in ...Information Retrieval. Our main focus is query evaluation. We describe an optimization algorithm that can compute efficiently most queries. We show, however, that the data complexity of some queries is #P-complete, which implies that these queries do not admit any efficient evaluation methods. For these queries we describe both an approximation algorithm and a Monte-Carlo simulation algorithm.
We perform a theoretical study of the following
query-view security problem: given a view
V to be published, does
V logically disclose information about a confidential query
S? The problem is ...motivated by the need to manage the risk of unintended information disclosure in today's world of universal data exchange. We present a novel information-theoretic standard for query-view security. This criterion can be used to provide a precise analysis of information disclosure for a host of data exchange scenarios, including multi-party collusion and the use of outside knowledge by an adversary trying to learn privileged facts about the database. We prove a number of theoretical results for deciding security according to this standard. We also generalize our security criterion to account for prior knowledge a user or adversary may possess, and introduce techniques for measuring the magnitude of partial disclosures. We believe these results can be a foundation for practical efforts to secure data exchange frameworks, and also illuminate a nice interaction between logic and probability theory.
The paper describes several applications of information inequalities to problems in database theory. The problems discussed include: upper bounds of a query's output, worst-case optimal join ...algorithms, the query domination problem, and the implication problem for approximate integrity constraints. The paper is self-contained: all required concepts and results from information inequalities are introduced here, gradually, and motivated by database problems.
We propose
quasi-stable coloring
, an approximate version of stable coloring. Stable coloring, also called color refinement, is a well-studied technique in graph theory for classifying vertices, ...which can be used to build compact, lossless representations of graphs. However, its usefulness is limited due to its reliance on strict symmetries. Real data compresses very poorly using color refinement. We propose the first, to our knowledge, approximate color refinement scheme, which we call quasi-stable coloring. By using approximation, we alleviate the need for strict symmetry, and allow for a tradeoff between the degree of compression and the accuracy of the representation. We study three applications: Linear Programming, Max-Flow, and Betweenness Centrality, and provide theoretical evidence in each case that a quasi-stable coloring can lead to good approximations on the reduced graph. Next, we consider how to compute a maximal quasi-stable coloring: we prove that, in general, this problem is NP-hard, and propose a simple, yet effective algorithm based on heuristics. Finally, we evaluate experimentally the quasi-stable coloring technique on several real graphs and applications, comparing with prior approximation techniques.
On the Tractability of SHAP Explanations Van den Broeck, Guy; Lykov, Anton; Schleich, Maximilian ...
The Journal of artificial intelligence research,
01/2022, Letnik:
74
Journal Article
Recenzirano
Odprti dostop
SHAP explanations are a popular feature-attribution mechanism for explainable AI. They use game-theoretic notions to measure the influence of individual features on the prediction of a machine ...learning model. Despite a lot of recent interest from both academia and industry, it is not known whether SHAP explanations of common machine learning models can be computed efficiently. In this paper, we establish the complexity of computing the SHAP explanation in three important settings. First, we consider fully-factorized data distributions, and show that the complexity of computing the SHAP explanation is the same as the complexity of computing the expected value of the model. This fully-factorized setting is often used to simplify the SHAP computation, yet our results show that the computation can be intractable for commonly used models such as logistic regression. Going beyond fully-factorized distributions, we show that computing SHAP explanations is already intractable for a very simple setting: computing SHAP explanations of trivial classifiers over naive Bayes distributions. Finally, we show that even computing SHAP over the empirical distribution is #P-hard.