For interactive data exploration, approximate query processing (AQP) is a useful approach that usually uses samples to provide a timely response for queries by trading query accuracy. Existing AQP ...systems often materialize samples in the memory for reuse to speed up query processing. How to tune the samples according to the workload is one of the key problems in AQP. However, since the data exploration workload is so complex that it cannot be accurately predicted, existing sample tuning approaches cannot adapt to the changing workload very well. To address this problem, this paper proposes a deep reinforcement learning-based sample tuner, RL-STuner . When tuning samples, RL-STuner considers the workload changes from a global perspective and uses a Deep Q-learning Network (DQN) model to select an optimal sample set that has the maximum utility for the current workload. In addition, this paper proposes a set of optimization mechanisms to reduce the sample tuning cost. Experimental results on both real-world and synthetic datasets show that RL-STuner outperforms the existing sample tuning approaches and achieves 1.6×-5.2× improvements on query accuracy with a low tuning cost.
The rise of Big Data era calls for more efficient and effective Data Exploration and analysis tools. In this respect, the need to support advanced analytics on Big Data is driving data scientist’ ...interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+ on platforms tailored for Big Data management.
General purpose graphical interfaces for data exploration are typically based on manual visualization and interaction specifications. While designing manual specification can be very expressive, it ...demands high efforts to make effective decisions, therefore reducing exploratory speed. Instead, principled automated designs can increase exploratory speed, decrease learning efforts, help avoid ineffective decisions, and therefore better support data analytics novices. Towards these goals, we present Keshif, a new systematic design for tabular data exploration. To summarize a given dataset, Keshif aggregates records by value within attribute summaries, and visualizes aggregate characteristics using a consistent design based on data types. To reveal data distribution details, Keshif features three complementary linked selections: highlighting, filtering, and comparison. Keshif further increases expressiveness through aggregate metrics, absolute/part-of scale modes, calculated attributes, and saved selections, all working in synchrony. Its automated design approach also simplifies authoring of dashboards composed of summaries and individual records from raw data using fluid interaction. We show examples selected from <inline-formula> <tex-math notation="LaTeX">160+ </tex-math> <inline-graphic xlink:href="yalcin-ieq1-2723393.gif"/> </inline-formula> datasets from diverse domains. Our study with novices shows that after exploring raw data for 15 minutes, our participants reached close to 30 data insights on average, comparable to other studies with skilled users using more complex tools.
A contiguous area cartogram is a geographic map in which the area of each region is proportional to numerical data (e.g., population size) while keeping neighboring regions connected. In this study, ...we investigated whether value-to-area legends (square symbols next to the values represented by the squares' areas) and grid lines aid map readers in making better area judgments. We conducted an experiment to determine the accuracy, speed, and confidence with which readers infer numerical data values for the mapped regions. We found that, when only informed about the total numerical value represented by the whole cartogram without any legend, the distribution of estimates for individual regions was centered near the true value with substantial spread. Legends with grid lines significantly reduced the spread but led to a tendency to underestimate the values. Comparing differences between regions or between cartograms revealed that legends and grid lines slowed the estimation without improving accuracy. However, participants were more likely to complete the tasks when legends and grid lines were present, particularly when the area units represented by these features could be interactively selected. We recommend considering the cartogram's use case and purpose before deciding whether to include grid lines or an interactive legend.
Summary
1. While teaching statistics to ecologists, the lead authors of this paper have noticed common statistical problems. If a random sample of their work (including scientific papers) produced ...before doing these courses were selected, half would probably contain violations of the underlying assumptions of the statistical techniques employed.
2. Some violations have little impact on the results or ecological conclusions; yet others increase type I or type II errors, potentially resulting in wrong ecological conclusions. Most of these violations can be avoided by applying better data exploration. These problems are especially troublesome in applied ecology, where management and policy decisions are often at stake.
3. Here, we provide a protocol for data exploration; discuss current tools to detect outliers, heterogeneity of variance, collinearity, dependence of observations, problems with interactions, double zeros in multivariate analysis, zero inflation in generalized linear modelling, and the correct type of relationships between dependent and independent variables; and provide advice on how to address these problems when they arise. We also address misconceptions about normality, and provide advice on data transformations.
4. Data exploration avoids type I and type II errors, among other problems, thereby reducing the chance of making wrong ecological conclusions and poor recommendations. It is therefore essential for good quality management and policy based on statistical analyses.
Dataflow visualization systems enable flexible visual data exploration by allowing the user to construct a dataflow diagram that composes query and visualization modules to specify system ...functionality. However learning dataflow diagram usage presents overhead that often discourages the user. In this work we design FlowSense, a natural language interface for dataflow visualization systems that utilizes state-of-the-art natural language processing techniques to assist dataflow diagram construction. FlowSense employs a semantic parser with special utterance tagging and special utterance placeholders to generalize to different datasets and dataflow diagrams. It explicitly presents recognized dataset and diagram special utterances to the user for dataflow context awareness. With FlowSense the user can expand and adjust dataflow diagrams more conveniently via plain English. We apply FlowSense to the VisFlow subset-flow visualization system to enhance its usability. We evaluate FlowSense by one case study with domain experts on a real-world data analysis problem and a formal user study.
Visualizing large-scale trajectory dataset is a core subroutine for many applications. However, rendering all trajectories could result in severe visual clutter and incur long visualization delays ...due to large data volume. Naively sampling the trajectories reduces visualization time but usually harms visual quality, i.e., the generated visualizations may look substantially different from the exact ones without sampling. In this paper, we propose CheetahTraj, a principled sampling framework that achieves both high visualization quality and low visualization latency. We first define the visual quality function measuring the similarity between two visualizations, based on which we formulate the quality optimal sampling problem (QOSP). To solve QOSP, we design the V isual Q uality G uaranteed S ampling algorithms, which reduce visual clutter while guaranteeing visual quality by considering both trajectory data distribution and human perception properties. We also develop a quad-tree-based index (InvQuad) that allows using trajectory samples computed offline for interactive online visualization. Extensive experiments including case-, user-, and quantitative-studies are conducted on three real-world trajectory datasets, and the results show that CheetahTraj consistently provides higher visual quality and better efficiency than baseline methods. Compared with visualizing all trajectories, CheetahTraj reduces the visualization latency by up to 3 orders of magnitude while avoiding visual clutter.
In this paper, we examine the possibility of employing the idea of progressive-inductive (π) aggregation in the k-means algorithm. We base our work on the interactive visualization framework called ...Skydive which is a tightly coupled system that combines data aggregation and visual aggregation for smooth user interaction. It is based on an aggregate pyramid data structure (π-cube), allowing a visualization client to quickly filter the data. Skydive was designed for interactive data visualization in terms of expressiveness and efficiency. Therefore, the underlying data structure enabled access to data on desired levels of granularity. In this paper, we utilize this concept to speed up the process of clustering proposing π-means - a modified version of the k-means algorithm. This will allow us to explore further possible applications of π-cube towards interactive data exploration.
This open access book presents the results of three years collaboration between earth scientists and data scientist, in developing and applying data science methods for scientific discovery. The book ...will be highly beneficial for other researchers at senior and graduate level, interested in applying visual data exploration, computational approaches and scientifc workflows.
In the interactive data exploration, approximate query processing (AQP) can be used to quickly return query results at the cost of accuracy. For online AQP, the sampler can be treated as an operator ...in the query plan. During the query optimization for AQP, heuristic rules are usually used to guide the sampler push-down. However, due to the complexity and changes of data distribution, the heuristic rule-based optimization methods cannot meet the users' query accuracy requirements. In this paper, we propose a learning-based query optimization method for online AQP. We first introduce the weak equivalence concept and propose a series of push-down rules to guide the sampler push-down during the query optimization. Then, to enable more queries to meet the users' query accuracy requirements, we propose a deep learning model to further optimize the query plan. By using this model during each push-down process of the sampler, we try to avoid the negative effect of inappropriate sampler push-down on query accuracy, especially when there is an inconsistency between the underlying and intermediate data distribution. Extensive experiments show that the method proposed in this paper can outperform the state-of-the-art online sampling-based AQP method by 1.2×-7.9× in query accuracy.