Existing visualization recommendation systems commonly rely on a single snapshot of a dataset to suggest visualizations to users. However, exploratory data analysis involves a series of related ...interactions with a dataset over time rather than one‐off analytical steps. We present Solas, a tool that tracks the history of a user's data analysis, models their interest in each column, and uses this information to provide visualization recommendations, all within the user's native analytical environment. Recommending with analysis history improves visualizations in three primary ways: task‐specific visualizations use the provenance of data to provide sensible encodings for common analysis functions, aggregated history is used to rank visualizations by our model of a user's interest in each column, and column data types are inferred based on applied operations. We present a usage scenario and a user evaluation demonstrating how leveraging analysis history improves in situ visualization recommendations on real‐world analysis tasks.
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support ...dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce
OrpheusDB
, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme,
LyreSplit
, to further optimize the models for reduced query latencies. With
LyreSplit
,
OrpheusDB
is on average
10
3
×
faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to
20
×
relative to schemes without partitioning.
LyreSplit
can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by
10
×
on average.
Better tools are needed to enable researchers to quickly identify and explore effective and interpretable feature-based explanations for discriminating multi-class genomic datasets, e.g., healthy ...versus diseased samples. We develop an interactive exploration tool, GENVISAGE, which rapidly discovers the most discriminative feature pairs that separate two classes of genomic objects and then displays the corresponding visualizations. Since quickly finding top feature pairs is computationally challenging, especially for large numbers of objects and features, we propose a suite of optimizations to make GENVISAGE responsive at scale and demonstrate that our optimizations lead to a 400× speedup over competitive baselines for multiple biological datasets. We apply our rapid and interpretable tool to identify literature-supported pairs of genes whose transcriptomic responses significantly discriminate several chemotherapy drug treatments. With its generalizable optimizations and framework, GENVISAGE opens up real-time feature-based explanation generation to data from massive sequencing efforts, as well as many other scientific domains.
•Finding feature pairs that separate object classes in genomic datasets is important•Our interactive GENVISAGE tool rapidly identifies and visualizes these feature pairs•Several optimizations make GENVISAGE up to 400× faster than baseline approaches•GENVISAGE finds supported gene pairs that discriminate between drug treatments
A fundamental task in the analysis of genomics datasets is identifying features that can explain the difference between two groups of biological samples. As studies and data repositories that enable simultaneous analysis of thousands of samples become widespread, it is imperative that feature identification tools return interpretable and significant results rapidly, allowing researchers to interactively generate and explore hypotheses on these massive datasets. Our tool, GENVISAGE, is built around a framework that identifies pairs of features that strongly separate samples of different classes. An extensive suite of optimization techniques enables us to extract literature-supported feature pairs with accompanying interpretable visualizations from exceptionally large genomic datasets in real time. The GENVISAGE optimizations and webserver instance provide a blueprint for future online tools providing interactive feature exploration in massive datasets from genomics and other domains.
Identifying features that most strongly separate samples from two biological classes is fundamental in the analysis of genomic datasets. This task is typically addressed by finding (1) single features using univariate statistical methods or (2) multi-feature combinations from time-intensive machine learning. Here we present GENVISAGE, a tool that enables researchers to interactively identify visually interpretable and significant feature pairs that separate the classes. With this highly optimized tool, researchers can instantaneously generate and explore hypotheses on very massive genomic datasets.
Exploratory data analysis is a crucial part of data-driven scientific discovery. Yet, the process of discovering insights from visualization can be a manual and painstaking process. This article ...discusses some of the lessons we learned from working with scientists in designing visual data exploration system, along with design considerations for future tools.
Exploratory data analysis is a crucial part of data-driven scientific discovery. Yet, the process of discovering insights from visualization can be a manual and painstaking process. This article discusses some of the lessons we learned from working with scientists in designing visual data exploration system, along with design considerations for future tools.
Interactive Data Exploration with Smart Drill-Down Joglekar, Manas; Garcia-Molina, Hector; Parameswaran, Aditya
IEEE transactions on knowledge and data engineering,
2019-Jan.-1, 2019-1-1, 20190101, Letnik:
31, Številka:
1
Journal Article
Recenzirano
Odprti dostop
We present smart drill-down , an operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each group of tuples is described by a rule . For ...instance, the rule <inline-formula><tex-math notation="LaTeX">(a, b, \star, 1000)</tex-math> <inline-graphic xlink:href="joglekar-ieq1-2685998.gif"/> </inline-formula> tells us that there are 1,000 tuples with value <inline-formula><tex-math notation="LaTeX">a</tex-math> <inline-graphic xlink:href="joglekar-ieq2-2685998.gif"/> </inline-formula> in the first column and <inline-formula><tex-math notation="LaTeX">b</tex-math> <inline-graphic xlink:href="joglekar-ieq3-2685998.gif"/> </inline-formula> in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are NP-Hard , and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.
Challenges in Data Crowdsourcing Garcia-Molina, Hector; Joglekar, Manas; Marcus, Adam ...
IEEE transactions on knowledge and data engineering,
2016-April-1, 2016-4-1, 20160401, Letnik:
28, Številka:
4
Journal Article
Recenzirano
Crowdsourcing refers to solving large problems by involving human workers that solve component sub-problems or tasks. In data crowdsourcing, the problem involves data acquisition, management, and ...analysis. In this paper, we provide an overview of data crowdsourcing, giving examples of problems that the authors have tackled, and presenting the key design steps involved in implementing a crowdsourced solution. We also discuss some of the open challenges that remain to be solved.
Visualization recommendation (VisRec) systems provide users with suggestions for potentially interesting and useful next steps during exploratory data analysis. These recommendations are typically ...organized into categories based on their analytical actions, i.e., operations employed to transition from the current exploration state to a recommended visualization. However, despite the emergence of a plethora of VisRec systems in recent work, the utility of the categories employed by these systems in analytical workflows has not been systematically investigated. Our article explores the efficacy of recommendation categories by formalizing a taxonomy of common categories and developing a system, Frontier , that implements these categories. Using Frontier , we evaluate workflow strategies adopted by users and how categories influence those strategies. Participants found recommendations that add attributes to enhance the current visualization and recommendations that filter to sub-populations to be comparatively most useful during data exploration. Our findings pave the way for next-generation VisRec systems that are adaptive and personalized via carefully chosen, effective recommendation categories.