Background

Observability tools act as safety nets for engineers, providing essential support for maintaining system stability and resolving issues swiftly. These tools play a critical role in incident managementโ€”a process that includes monitoring, alerting, triaging, investigating, and remediating.

My team specifically focuses on the investigating stage, where effective analysis helps engineers pinpoint the root cause of problems and ensures quick, efficient incident resolution.

The stages of incident management. Our team focused on the investigating stage.

Problems

๐Ÿงถ Fragmented Correlation Workflows

Our product's multiple correlation tools, including RCA and Change Intelligence, lacked a unified workflow, creating a fragmented user experience.

๐Ÿšง Incomplete Product Features

Each correlation tool supported different telemetry types, creating limitations for users trying to investigate issues comprehensively. This inconsistency restricted users' ability to leverage correlations effectively across various data sets.

๐Ÿ˜ต Unintuitive UI

User feedback indicated that the UI, particularly for Change Intelligence, was complex and difficult to navigate. Tight deadlines and technical constraints contributed to a steep learning curve, reducing the featureโ€™s overall value.

๐Ÿ•ฐ๏ธ The history of our correlations feature at a glance

  • ~2017: Tracing correlations feature โ€œRCAโ€ and the correlation panel
    Our product started with a tracing only product, so our first correlation feature was a tracing only feature, that analyzes the rate of traces.
  • 2021: Metrics correlations feature โ€œChange Intelligenceโ€
    Later on, we added metrics, and launched our metrics correlations feature Change intelligence, which bridges metric data with trace rate data for analyzing metrics spikes

Goal: A unified correlations feature with a facelift

To streamline our workflows and increase the usability and usefulness of our correlations feature, we need to enhance and upgrade our latest feature, Change Intelligence, while phasing out the old correlation feature, RCA and correlation panel.

From this ๐Ÿ˜ฌ

Before: Three correlations features scattered in different places of the app.

To this ๐Ÿ˜

After: A single correlations features across the app.

Challenges

  • The abundance of ambitious ideas made it challenging to establish a clear project path.
  • Existing technical limitations added complexity to our plans.
  • The success of the project heavily relied on the quality of customer data.
  • Continually balancing diverse customer expectations was a persistent challenge throughout the project.

Research & Findings

To uncover the key pain points and opportunities for improvement, I conducted extensive research through user interviews, session analysis, competitive benchmarking, and collaborative discussions with stakeholders.

1. Current Feature Testing and Analysis

To understand how users interacted with the existing correlation features, I reviewed approximately 200 FullStory sessions and conducted interviews with 12 users (4 internal and 8 external) who regularly used the product.

User feedback we got for Change Intelligence
User feedback we got for Root Cause Analysis

๐Ÿ’ก Findings

  • Quickly accessing traces was the most critical for users when investigating issues and validating assumptions, but the product did not make this process intuitive or efficient.
  • Numbers were presented without sufficient context or explanation, making it challenging for users to interpret critical details.
  • The design of the correlations feature failed to clearly communicate its functionality, leading to confusion and rendering it ineffective for many users.
  • Tables lacked clarity and organization, failing to provide concise information or indicate trends effectively.
  • The product relied heavily on charts for correlation validation, but their usability and integration into workflows needed significant improvement.

2. Competitive Analysis

To better understand where our product stood in the market, I analyzed Honeycombโ€™s BubbleUp feature, a frequent comparison made by users, with input from our GTM experts.

Workflow breakdown of the two features

๐Ÿ’ก Findings

  • While our product offered more accurate and data-rich results, the UI failed to present this advantage in a user-friendly and intuitive way, giving BubbleUp an edge in usability.

3. Internal Alignment Working Sessions

To ensure our research findings translated into actionable strategies, I facilitated collaborative sessions with stakeholders, engineers, and designers to prioritize our goals and roadmap.

๐Ÿ’ก Findings

  • With a wealth of data and insights generated by our backend, it became critical to design an experience that presents just the right information to users at the right time, without overwhelming them. Effectively utilizing this backend data and research findings was essential to creating a meaningful and actionable user experience.

To understand how research findings influenced design decisions, please continue reading the motivations in the iterative implementation section โฌ‡๏ธ .

Iterative implementation

Given the magnitude and complexity of the problem, we planned out four milestones. These milestones are designed to address and solve the problem incrementally, ensuring thorough consideration and attention to each. This allowed us to stay organized and kept us on track towards our main goal.

We broke down the work into four milestones, including a listening phase.

M1: Correlations for all telemetry types

M2: More intuitive correlation view

M3: In-context Change intelligence

M4: Adding latency and error correlations

Listening phase and final iteration

Design highlights

Correlations results view

  • Time series charts replaced tables based on research showing users understand trends fastest through this visualization format
  • Direct trace access added to charts, providing immediate evidence for investigations
  • Flattened hierarchy improves scannability and digestibility
  • Streamlined filtering prevents initial overwhelm while maintaining analysis power
Before
After

Result

Following implementation and enhancement of the feature, daily active users increased by 130%. User engagement metrics showed higher average session duration as people explored the tool's expanded capabilities. Success stories from users confirmed the tool effectively met their specific needs, validating its improved functionality and user experience.

Lessons learned

๐Ÿ–Š๏ธ UI copy matters (a lot)

Initially, we used โ€œWhat caused this change?โ€ as the CTA, but it misled users into expecting answers rather than clues, eroding trust. In the second iteration, we switched to โ€œAnalyze deviation,โ€ which was less misleading but still unclear. Finally, we refined the CTA to โ€œView correlations,โ€ which clarified expectations and received positive feedback from both users and our go-to-market team, effectively addressing the issue.

๐Ÿ‘ฏ Build a good relationship with your GTM (go-to-market) team and lean on their expertise

Our GTM team was instrumental in shaping our correlation feature, gathering customer feedback, and strategizing its promotion. Their insights and collaboration provided valuable perspectives that led to a better product for our customers.

๐Ÿ† Data quality is also a part of the UX

A key challenge was designing for poor data quality, as our correlation feature depends on well-formatted data, which most customers lacked, leading to inaccurate results. I advocated against quick fixes like tooltips and pushed for refining our algorithm to handle varied data configurations. This approach, though requiring more effort, has been embraced and will make the product more adaptable and user-friendly in the long term.

Future: auto-detection and analysis

As of now, the feature still requires manual inputs - users need to click on a spike to launch it. In the future, we aim to have this feature always running in the background. This would allow us to identify abnormal spikes and correlations for our users, thereby reducing their MTTR (Mean Time To Recover) and making the process more effortless.