I led this project under the mentorship of the Data Science group director at Nokia Bell Labs
Researched, designed and developed a prototype for data exploration in an autoML tool.
June 2022 - May 2023
Exploratory data analysis is a critical process in data science, as it lays the foundation for insights and models to be developed. However, this process can be time-consuming and cognitively overwhelming, with most data scientists spending more than half of their time studying datasets with 100+ columns.
While the past few years have seen a large rise in the number of MLOps and AutoML tools that look to automate the ML pipeline, data analysis continues to be a manual process.
Despite the increasing trend in automating different parts of the DS lifecyle, and increasing usage patterns of autoML tools such as VertexAI and AzureML, data exploration continues to be a painstakingly time consuming and manual process. Due to the amount of creativity required in this phase, they must be carefully designed to place the control in the hands of the human. At what points can automation be introduced to augment human creativity and curiousity?
Understand how data scientists would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools.
My research goals of Round 1 were to understand how users would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools. To explore this, I conducted 45 minute semi-structured with 6 data scientists within a research team.
Some users relied heavily on modeling to understand the data, and were looking for tools that would give them a stronger grasp.
Users talked about feeling lost while studying large datasets, not knowing what to do with the data, and the danger of overlooking patterns.
Users talked about existing autoML tools being a “no-brainer” and “blackboxes”.
Concept 1: Quality Fingerprint
Understanding the quality of data was something crucial to users in the initial steps. The system could reduce the load placed on users by automatically calculating quality based on commonly used criteria, and visualizing its varying levels in a single view.
Concept 2: Quality Fingerprint with columns
While the previous concept prioritize visualizing the entire dataset in a single view, this concept focused on giving users contextual information about the columns through column names.
It also looked to provide users with a more natural transition into the second level of granularity by substituting colors with values when zooming in.
Once users were familiarized with the data, the goal of level 2 & 3 was to pass the control onto the human and minimize system interventions.
Guiding users using automated insights
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Visualizing key relationships
Data scientists at the Bell Labs research group have created algorithms to extract relaionships from the data. This section houses them together, conveying important statistical features through network visualization diagrams.
Automated interactive visualizations
The final level of granularity helps data scientists inspect columns individually by evaluating its metadata and distribution of values across the column.
Understand how quickly users were able to navigate through the dataset at different levels of detail and how well the system supported their data exploration workflow.
The tasks given to users were:
T1, Level 1: Gather an overview of the data using the visualization and insights in Level 1.
T2, Level 2: Use the automated insights in Level 2 to study specific columns
T3, Level 2: Understand relationships between columns through automated visualizations
T4, Level 2: Filter to generate automated insights for a subset of columns
T5, Level 3: Study columns individually and filter values using the interactive visualization
The scenarios I created journey maps for are:
Key Feedback
Considering the high amount of cognitive load placed on users while studying tabular data, designing comprehensive visualizations to ease users into a dataset was one of my primary goals. To benchmark design concepts I created a list of inspired by 3 principles based on E. Tufte's guidelines and 3 from user feedback.
Initial concept: Encoding quality in an interactive heatmap
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Initial concept
At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.
Final Concept: Connecting views using a brushing & linking interaction
An example of brushing & linking using D3.js
I had 20-minute short interview sessions with 2 visualization experts at Georgia Tech and 2 senior data scientists.
While I recieved positive feedback on the concepts, visualization experts suggested using lesser space for the main visualization and thinking about how users would decipher the column names when 50+ data points are involved.
Having laid down the foundation of the tool in the previous design phase, the goal of this design phase was to further refine the system based on feedback from usability tests.
Users did not find the column-level visualizations too helpful in the previous design. Rather than viewing the distribution of individual columns, they would prefer to plot relationships between various columns.
Similarly, users mentioned they would prefer to see data that is more descriptive of the dataset before understanding its quality.
A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information.
Will this design change be enough for users to trust the overview visualization?
2 users voiced their confusion on their first impression of the chart. However, once users spent a few minutes understanding the view, they were able to grasp the visualization.
3 out of the 4 users seemed skeptical about the visualization. One user even counted the dots in the visualization to make sure all columns were correctly being represented!
4/4 users in the usability tests immediately wanted to know more about the insights were being calculated.
2/4 users in the usability tests wanted to be able to play around with the thresholds. This also aligned with a key finding from the exploratory interviews where 2/4 users mentioned that their trust in the system overlapped with how much control they had.
4/4 data scientists in the usability tests responded positively to histogram visualizations of each column. Allowing them to preview each column in this screen to validate errors noticed by the system, should allow data scientists to be more comfortable with the scores.
Users in this round of usability test followed a similar thought process while validating automated insights. Their trust in the system increased progressively across the 3 stages of validation. Unlike in the previous rounds of usability tests, users were comfortable with the insights generated by the system.
Unlike in the previous rounds of usability testing, where it took a while for users to grasp what the visualization was conveying, users in this round were immediately able to get an overview of the data. All users further interacted with the visualization by clicking on the bars to test their understanding.
Across previous usability tests there were certain users who were skeptical about the insights as they were not familiar with the data visualization. However, 4/4 users were comfortable with representing columns using bar graphs and coincidentally had higher trust ratings in the system.