OverviewSolutionResearchDesignResearchDesign IITestingDesign IIITestingPrototypeBrandingReflection

Accelerating data exploration through automation

Team & Role

I led this project under the mentorship of the Data Science group director at Nokia Bell Labs

What I did

Researched, designed and developed a prototype for data exploration in an autoML tool.

What I delivered

  • High fidelity wireframes
  • User feedback from 10 user interviews, 13 usability tests
  • Market Research
  • 3 Journey maps
  • Framework built on React.js and R
  • Product Branding Guide

Project Duration

June 2022 - May 2023

Note: Brand naming and design were done as part of a class project.

Problem space

About 50-80% of a data scientists' time is spent on making sense out of large datasets

Exploratory data analysis is a critical process in data science, as it lays the foundation for insights and models to be developed. However, this process can be time-consuming and cognitively overwhelming, with most data scientists spending more than half of their time looking for interesting patterns in datasets with 100+ columns.

Proposed solution

DS/ML solutions in the past few years have seen a large rise in the tools that look to automate the ML pipeline. However, these tools fail to accelerate data analysis as they rely on completely automating the end-to-end pipeline, leaving humans out of the loop.

The human and curiousity-driven nature of knowledge discovery makes it difficult to automate. This means that 70-80% of a data science project continues to be manual and cognitively overwhelming, leaving it up to the users to connect the dots in extremely large datasets.

Research Scientists at Bell Labs recognized this gap and integrated statistics and machine learning techniques in their autoML solution to augment human judgement during data exploration.
How can a tool support and augment the user while still ensuring they are in control?

Exploratory analysis can be overwhelming when faced with a large, unfamiliar dataset. How can its UX help ease users into a dataset, and tell a story?

People are skeptical when any automation is involved. At what points can we improve transparency to boost user trust?

Design challenge

However, the interface of the existing autoML tool limited the user and prevented them from taking control of the exploration process. How can I better integrate automated techniques to support knowledge discovery? 

Process

This was a long project, with 2 major design iterations across a period of one year.
Detailed Timeline

Solution

I designed a layered data exploration framework to reduce the cognitive load on users during analysis.

Problem space

About 50-80% of a data scientists time is spent on making sense out of large datasets

Despite the increasing trend in automating different parts of the DS lifecyle, and increasing usage patterns of autoML tools such as VertexAI and AzureML, data exploration continues to be a painstakingly time consuming and manual process. Due to the amount of creativity required in this phase, they must be carefully designed to place the control in the hands of the human. At what points can automation be introduced to augment human creativity and curiousity? 

User Research I

Research goals

Understand how data scientists would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools.

My research goals of Round 1 were to understand how users would explore an unfamiliar tabular dataset for a classification task and their experience with existing autoML tools. To explore this, I conducted 45 minute semi-structured with 6 data scientists within a research team.

Key Finding #1

Users had different perspectives on data exploration

Some users relied heavily on modeling to understand the data, and were looking for tools that would give them a stronger grasp.


Key Finding #2

Users faced a high cognitive load while studying datasets

Users talked about feeling lost while studying large datasets, not knowing what to do with the data, and the danger of overlooking patterns.


Key Finding #3

Users felt a lack of control in existing autoML tools

Users talked about existing autoML tools being a “no-brainer” and “blackboxes”.

Finding #1

Users faced a high amount of cognitive load while studying data.

Users faced a high amount of cognitive load while studying data.

Design implication: Support users' exploratory workflows through better structure and guidance.
Finding #2

This led to some users brute forcing models to better understand their data.

This led to some users brute forcing models to better understand their data.

Design implication: Encourage users to be curious and explore by highlighting strange things in a dataset.
Finding #3

All users agreeed that data quality was key to their sensemaking.

All users agreeed that data quality was key to their sensemaking.

Design implication: Visualize the quality of a dataset to accelerate data analysis and sensemaking
Finding #4

Users were reluctant to shift to newer tools.

Users were reluctant to shift to newer tools.

Design implication: Reduce the learning curve involved by designing flows that feel natural and familiar to users.

Design Goals

Design : Cognitive Load

EDA is deeply layered and involves inspecting the data at different granularities. What insights can we automate to help accelerate this process? 
My solution was to build a user experience around zooming in and zooming out of a dataset.
I structured the data sensemaking process and insights into 3 levels of granularity.
My goal was to design a layout that allowed users to move between different granularities.

Concept 1: Quality Fingerprint

Understanding the quality of data was something crucial to users in the initial steps. The system could reduce the load placed on users by automatically calculating quality based on commonly used criteria, and visualizing its varying levels in a single view.

Concept 2: Quality Fingerprint with columns

While the previous concept prioritize visualizing the entire dataset in a single view, this concept focused on giving users contextual information about the columns through column names.

It also looked to provide users with a more natural transition into the second level of granularity by substituting colors with values when zooming in.

The previous two concepts used a lot of vertical space that did not convey any information. While it seemed more closer to how columns are viewed, it was important to prioritize efficiency over aesthetic considering the large amount of columns in a dataset.
A heatmap would allow users to view column names while accomadating more columns in a single view.
Users talked about the dangers of overlooking patterns while studying data. My solution was to provide users with a fingerprint of their dataset.
To ease the initial cognitive load placed on users, I proposed using level 1 of exploration to communicate global views of the dataset. By highlighting "strange things" in a dataset, these would also serve as more accessible entrypoints that encourage curiousity among users and guide a deeper exploration.

As an example, I chose to visualize the quality of a dataset for my prototype. Green indicated columns having minimal errors, while red highlighted columns showing various type of errors.
I used an interactive heatmap to help users visualize the quality of their data.
Once users had a birds eye view of their dataset, the subsequent granularity levels shifted control from the system to users.


At the second granularity level, my goal was to allow users to feel in control of the exploration process and further inspect various patterns.

Guiding users using automated insights

At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.

Visualizing key relationships

Data scientists at the Bell Labs research group have created algorithms to extract relaionships from the data. This section houses them together, conveying important statistical features through network visualization diagrams.

I designed minimal system interventions that guided the users attention towards interesting columns.
The final granularity level helped users study individual columns in isolation.


The final granularity level helped users study individual columns in isolation.

User Research II

I tested my prototype with users and conducted another round of interviews to dive deeper into key research areas.

Research goals

Understand how quickly users were able to navigate through the dataset at different levels of detail and how well the system supported their data exploration workflow.


The tasks given to users were: 

T1, Level 1: Gather an overview of the data using the visualization and insights in Level 1.
T2, Level 2: Use the automated insights in Level 2 to study specific columns
T3, Level 2: Understand relationships between columns through automated visualizations
T4, Level 2: Filter to generate automated insights for a subset of columns
T5, Level 3: Study columns individually and filter values using the interactive visualization

Research goals

  1. Understand the influence of control on trust while working with automated tools
  2. Understand the points at which users would want to interact with the system
  3. Underestand how cognitive load can be reduce while working with large datasets

Key Findings

Finding #1

Users appreciated the tool for providing useful starting points within datasets.

Users appreciated the tool for providing useful starting points within datasets.

Finding #2

However, 80% of the users wanted to see more quantitative evidence included in automated insights.

However, users had difficulties trusting the automated insights

Design implication: Integrate quantitative insights and eliminate qualitative language to boost user trust.
Finding #3

All users believed that their trust in the system overlapped with transparency & control.

All users believed that their trust in the system overlapped with transparency & control

Design implication: Find the right balance between control and automation.
Finding #4

Some users wanted to see a different first view of the dataset.

Some users wanted to see a different first view of the dataset

Design implication: Give users a more descriptive preview of the dataset before presenting information about its quality. Redesign the visualization to be scalable to 100+ columns.
Finding #5

Users saw the potential for more ways of narrowing down on columns.

Users saw the potential for more ways of narrowing down on columns

While the filter feature in level 2 was well recieved by all users, some users found it to be unituitive. Users seemed more comfortable filtering through columns at the overview level.
Design implication: Integrate a filter feature at the overview level, to further reduce the cognitive load placed on users during analysis.
Finding #6

Studying column relationships enhances sensemaking more than analyzing individual values.

Users did not find Level 3 too helpful to their analysis

Design implication: Redesign the final level to include more relationships exploration


Key Feedback

Design goals

Creating journey maps helped me identify key patterns and interaction points during data analysis

The scenarios I created journey maps for are: 

  1. Manual analysis of data: Data scientist uses Python to build a machine learning model that predicts how capable each applicant is of paying back a loan. 
  2. Analysis of data using Dataiku: Data scientist uses Dataiku to build a model for revenue forecasting. Dataiku is a leading competitor in the space of human-centered autoML that I identified during my market research.
  3. Analysis of data using the augmented data science (ADS) tool developed by data scientists at Bell Labs: Data scientist uses ADS tool to build a machine learning model that predicts how capable each applicant is of paying back a loan.

Design II : Cognitive Load, Interactivity & Usability

While studying large amounts of data, some users preferred minimal visual cues while others wanted to visualize all columns without having to scroll. How could I redesign the overview screen to address these user needs?
I combined visualization principles with user goals to further evaluate concepts for a global view.

Considering the high amount of cognitive load placed on users while studying tabular data, designing comprehensive visualizations to ease users into a dataset was one of my primary goals. To benchmark design concepts I created a list inspired by 3 principles based on E. Tufte's guidelines and 3 from user feedback.

After creating a list of benchmarks, I realized that the existing concept could be further optimized to improve usage of color, space and interactions.

Initial concept: Encoding quality in an interactive heatmap

1. Encourage the eye to compare different pieces of data
2. Reveal the data at different levels of details
3. Maximize the data-ink ratio
4. Eliminate or reduce scrolling
5. Place higher importance on visualizing key distributions and column relationships
6. Allow users to filter, select, and drill down on subsets of columns
I iterated through different visualizations with the goal of maximizing the data to ink ratio:
The final concept allowed users to interact with a minimal overview visualization to open connected views using a brushing & linking interaction.

Initial concept

At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.

Initial concept

At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.

Initial concept

At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.

Initial concept

At this level, my goal was to help users navigate through the table using automated insights. The insights were designed to highlight strange things in a dataset that often tap into the curiousity of data scientists during exploration.

Final Concept: Connecting views using a brushing & linking interaction

1. Encourage the eye to compare different pieces of data
2. Reveal the data at different levels of details
3. Maximize the data-ink ratio
4. Eliminate or reduce scrolling
5. Place higher importance on visualizing key distributions and column relationships
6. Allow users to filter, select, and drill down on subsets of columns
The goal of the interaction was to allow users to intuitively scan through large datasets and visualize interesting data points in further granularity.

An example of brushing & linking using D3.js

By creating 3 responsive views, the interaction adds another layer of visualization before users dive deeper into specific columns.

1. Encourage the eye to compare different pieces of data
2. Reveal the data at different levels of details
3. Maximize the data-ink ratio
4. Eliminate or reduce scrolling
5. Place higher importance on visualizing key distributions and column relationships
6. Allow users to filter, select, and drill down on subsets of columns
I sketched a storyboard and collected feedback from 4 users.

I had 20-minute short interview sessions with 2 visualization experts at Georgia Tech and 2 senior data scientists.

While I recieved positive feedback on the concepts, visualization experts suggested using lesser space for the main visualization and thinking about how users would decipher the column names when 50+ data points are involved.

Based on user feedback, I designed my high-fidelity prototype to incorporate an edge case of 50+ columns.

Design II

I brainstormed design updates from the lens of control, trust and cognitive load.

Having laid down the foundation of the tool in the previous design phase, the goal of this design phase was to further refine the system based on feedback from usability tests and interviews.

Redesigning the layout to optimize it for interactivity, transparency and space:
Refining the granularity levels to better match user expecations:

Users did not find the column-level visualizations too helpful in the previous design. Rather than viewing the distribution of individual columns, they would prefer to plot relationships between various columns.

Similarly, users mentioned they would prefer to see data that is more descriptive of the dataset before understanding its quality.

Allowing users alternate ways to explore data:

A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information. Will this design change be enough for users to trust the overview visualization?

A key finding from usability tests was that users were not willing to trust the visualization without seeing quantitative information.

Will this design change be enough for users to trust the overview visualization?

Redesigning the final level to help users explore relationships between columns:

During the usability tests I discovered that visualizing relationships between columns was more critical to users' sensemaking process than studying a single column in isolation. For the final granularity level, my goal was to allow users to visualize relationships between columns without having to write code.

Testing

Research goals

  1. Evaluate user trust in system insights
  2. At what points do users demand more control?
  3. How do they use the visualization to explore the dataset?
I replicated flows from the journey maps to evaluate for control, trust and cognitive load.
T1: Preview the data
T2: Explore the layout of the system
T3: Analyze the quality of the columns
T4: Dive deeper into the poorer quality columns
T5: Create a subset of the columns
T6: Review automated insights
T7: Handle missing values in a column
T8: Review changes made by the system

Key Findings:

While users' trust levels increased, they were still hesitant to completely trust the automated aggregate.

While users' trust levels increased, they were still hesitant to completely trust the automated aggregate.

50% of the users found the overview visualization ambiguous.

50% of the users found the overview visualization ambiguous.

2 users voiced their confusion on their first impression of the chart. However, once users spent a few minutes understanding the view, they were able to grasp the visualization.

Users comfort levels with the visualization possibly impacted their trust.

Users comfort levels with the visualization possibly impacted their trust.

3 out of the 4 users seemed skeptical about the visualization. One user even counted the dots in the visualization to make sure all columns were correctly being represented!

Design III : Control & Trust

I updated the overview visualization to be more familiar with users.
Building upon user feedback, I designed a 3-step flow for validating system generated insights.
Building upon user feedback, I designed a 3-step flow for validating system generated insights.

Step 1: How is the system generating a score?

4/4 users in the usability tests immediately wanted to know more about the insights were being calculated.

Step 2: Can I control the way in which it generates insights?

2/4 users in the usability tests wanted to be able to play around with the thresholds. This also aligned with a key finding from the exploratory interviews where 2/4 users mentioned that their trust in the system overlapped with how much control they had.

Step 1: Explainability

4/4 users in the usability tests immediately wanted to know more about the insights were being calculated.

Step 2: Verification

4/4 data scientists in the usability tests responded positively to histogram visualizations of each column. Allowing them to preview each column in this screen to validate errors noticed by the system, should allow data scientists to be more comfortable with the scores.

Step 3: Experimentation

2/4 users in the usability tests wanted to be able to play around with the thresholds. This also aligned with a key finding from the exploratory interviews where 2/4 users mentioned that their trust in the system overlapped with how much control they had.

Testing

Research goals

  1. Revaluate user trust in system insights
  2. Evaluate if users were satsified with the amount of control
  3. Evaluate how users would prefer to interact with the visualization for sensemaking
Trust scores in automated insights increased by 31%.

Trust scores in automated insights increased by 31%.

Users in this round of usability test followed a similar thought process while validating automated insights. Their trust in the system increased progressively across the 3 stages of validation. Unlike in the previous rounds of usability tests, users were comfortable with the insights generated by the system.

All users were comfortable with the overview visualization.

All users were comfortable with the overview visualization.

Unlike in the previous rounds of usability testing, where it took a while for users to grasp what the visualization was conveying, users in this round were immediately able to get an overview of the data. All users further interacted with the visualization by clicking on the bars to test their understanding.

There could be a possible overlap between visualization type and user trust.

There could be a possible overlap between visualization type and user trust.

Across previous usability tests there were certain users who were skeptical about the insights as they were not familiar with the data visualization. However, 4/4 users were comfortable with representing columns using bar graphs and coincidentally had higher trust ratings in the system.

Final Prototype

I designed a UX to ease users into a tabular dataset using the martini-glass storytelling approach.
An information architecture that allows users to explore data at different granularities using relevant automated insights for each level.
Overview visualizations that give users potential entrypoints into large, complex datasets and interactions to help users narrow down on interesting columns
Transparent and explainable automated insights, to help users understand system actions under-the-hood.
3 stages of verification to help boost user trust in automation.
Final layer to allow users to conduct their own exploration and generate visualizations for starred columns.

Branding

Reflection

What I learned from this experience

Out of the various different stages in an ML pipeline, accelerating exploratory data analysis continues to be a challenging problem space. This is largely due to the numerous ways in which a data scientist can approach a problem and choose to conduct his analysis. A lot of modern tools have tried to automate this analysis and in the process left humans out of the loop.

At the end of this project, I propose a human-driven framework, that presents automated insights dynamically as the user drills down on a dataset. The structured organization of the system allows users to ease into a dataset, and not get overwhelmed by the large number of columns. By designing overview visualizations that give users quick ideas about the dataset, the proposed framework gives direction to their exploration and provides them with various entrypoints into a dataset.

There remains more work to be done before modern automation-based tools can be introduced to the workflows of experts. Throughout my research, I’ve learned about the skepticism in userswhile relying on results from processes not designed by them. Simply being transparent about the methods being used was not enough to gain the trust of users. As evidenced by the 3-step flow I designed, systems must accommodate various methods for users to further investigate the accuracy of insights.