Background

πŸ› οΈ Observability tools

Observability tools are like a safety net for engineers. They get peace of mind knowing everything's running smoothly, and a jump start on fixing stuff when it's not. Engineers might not be glued to these tools all day, but they're super important for keeping their systems and apps in steady states.

πŸ”Ž Incident management

The process of incident management can be distilled into five key stages: monitoring, alerting, triaging, investigating, and remediating. Our team zeroes in on the investigation stage. That's when engineers are busy with loads of browser tabs, diving into the details, comparing chart, and unraveling the root cause of the problem.

The stages of incident management. Our team focused on the investigating stage.

Problems

🧢 Inconsistent workflows

Over the last two years, we've seen our product evolve towards a 'kitchen sink', with a variety of features but a lack of coherent workflow. To rectify this, we have to phase out outdated features and grow newer ones to unify workflows without compromising the product experience.

🀼 Three correlations features

Our product began as a tracing-only tool, with our correlations features, RCA and correlation panel, designed to analyze trace rates. We later added metrics and launched a metric correlation feature called Change Intelligence. This feature combines metric data with trace rate data to analyze metric spikes. However, this addition increased the product's learning curve as users now had to understand three correlation features.

😡 Unintuitive UI

Over the years, we have received direct feedback from our users that our UI for correlation features, including the latest one, Change Intelligence, was unintuitive and required a steep learning curve. It was built under many assumptions and technical constraints, with very limited time. Users struggled to navigate the feature and understand its functionality, thereby diminishing its value.

πŸ•°οΈ The history of our correlations feature at a glance

  • ~2017: Tracing correlations feature β€œRCA” and the correlation panel
    Our product started with a tracing only product, so our first correlation feature was a tracing only feature, that analyzes the rate of traces.
  • 2021: Metrics correlations feature β€œChange Intelligence”
    Later on, we added metrics, and launched our metrics correlations feature Change intelligence, which bridges metric data with trace rate data for analyzing metrics spikes

Solution: A unified correlations feature with a facelift

To streamline our workflows and increase the usability and usefulness of our correlations feature, we need to enhance and upgrade our latest feature, Change Intelligence, while phasing out the old correlation feature, RCA and correlation panel.

From this 😬

Before: Three correlations features scattered in different places of the app.

To this ☺️

After: A single correlations features across the app.

Challenges

  • The abundance of ambitious ideas made it challenging to establish a clear project path.
  • Existing technical limitations added complexity to our plans.
  • The success of the project heavily relied on the quality of customer data.
  • Continually balancing diverse customer expectations was a persistent challenge throughout the project.

Research

  • Throughout the project, I interviewed power users and newly onboarded users to identify issues, validate assumptions, and evaluate designs.
  • I led working sessions with stakeholders, the engineering team, and designers to align, synthesize, and co-design.
  • I also conducted competitive analysis to understand industry standards and identify potential opportunities for improvement.
  • I regularly reviewed FullStory sessions to understand user behavior, identify usage improvements, and detect usability problems.

To understand how research findings influenced design decisions, please continue reading the motivations in the iterative implementation section ⬇️ .

Iterative implementation

Given the magnitude and complexity of the problem, we planned out four milestones. These milestones are designed to address and solve the problem incrementally, ensuring thorough consideration and attention to each. This allowed us to stay organized and kept us on track towards our main goal.

We broke down the work into four milestones, including a listening phase.

M1: Correlations for all telemetry types

🧐 Motivation

  • Eliminate the additional learning curve: Before, users had to be on a specific metric chart to access the correlation features, which often led to unnecessary overhead and frustration.
  • Improve feature discoverability: The original features were hidden behind a click on a chart, which compromised discoverability and daily usage.

πŸ‘©πŸ»β€πŸ’» User outcome

  • Users can always access the same correlations feature on both metric and trace charts anywhere in the product.
  • Users can easily discover and understand the feature on their own.

πŸ“ Scope

  • Added correlations for trace charts for a more consistent user experience.
  • Introduced clearer Calls-to-Action (CTA) for better discovery and clarity.
  • Revamped language and layout to reduce confusion and improve overall clarity.

πŸ“Š Result

  • Saw significant increase of daily usage and user awareness of the feature.

M2: More intuitive correlation view

🧐 Motivation

  • Increase the user-friendliness of the page: During discovery, we learned that while the correlations we provide can be very helpful, it can be hard for users to understand the why and how due to the technical complexity and data-heavy presentation.

πŸ‘©πŸ»β€πŸ’» User outcome

  • Users can easily grasp the underlying concepts and get value out of the feature despite their experience level.

πŸ“ Scope

  • Redesigned the correlation view that incorporates newer query capabilities to better explain correlations to the users.
  • Introduced intuitive visualizations and interactive elements to simplify data presentation.
  • Added clear and concise explanations and examples throughout the page.

πŸ“Š Result

  • Received positive feedback on the improvement of overall user-friendliness

M3: In-context Change intelligence

🧐 Motivation

  • Minimize disruption and lower commitment: Minimize disruption and lower commitment: Right now, users will be taken to a different page to see any useful insights we provide, which can feel cumbersome and disruptive. We learned by watching some real user sessions in Fullstory and talking to some users that some people hesitated to use the feature because they didn’t want to commit to a different page and lose the context of their work.
  • Making Query Iteration Easier: We know that digging into issues isn't a straight line. It often means going back and tweaking your queries a few times to really get to the bottom of things. Our aim with in-context correlations is to make this a lot smoother. We're putting useful insights right where you're working, so you can grab them easily and keep going with your investigation without missing a beat.

πŸ‘©πŸ»β€πŸ’» User outcome

  • Users can launch the feature and advance their investigation without disrupting their current workflow.

πŸ“ Scope

  • Implemented an in-context side panel, offering a low-commitment avenue for users to derive value without initiating a new workflow.
  • Implemented the capability to easily update queries using correlations.

πŸ“Š Result

  • Saw an increase in the average session length, indicating that users are spending more time utilizing the feature and gradually building trust.

M4: Adding latency and error correlations

🧐 Motivation

  • More accurate correlations that describe the complete picture: Our correlations feature, which analyzes patterns in the rate of traces, provides very helpful insights. However, it can sometimes be partial. This is because sometimes the rate might be normal while latency or the error rate is skyrocketing. So it’s always good to look at all three signals to understand the complete picture.

πŸ‘©πŸ»β€πŸ’» User outcome

  • Users can easily see rate, latency, and error correlations to understand the entirety of the situation.

πŸ“ Scope

  • Added the ability to analyze on latency and error data

πŸ“Š Result

  • Received positive feedback on the improvement of accuracy of the correlations result.

Listening phase and final iteration

After the multi-milestone project was done, we took some time to let the feedback bake from both our internal and external users. And did a final round of feedback synthesis and update including updating some copy, rearrange some charts, and fixing minor bugs.

Result

This round of improvements not only boosted discoverability but also delivered a more user-friendly and comprehensive experience that works across different telemetry. Beyond usage data, these enhancements played a pivotal role in securing new contracts and upselling our product, establishing it as a distinctive offering in the market.

Lessons learned

πŸ–ŠοΈ UI copy matters (a lot)

In the initial version of this feature, we wanted to engage users by using the question "What caused this change?" as the UI copy for the call-to-action (CTA) button. However, this approach unintentionally led users to believe we possessed the answer, resulting in an expectation gap, as we were, in fact, providing clues. This gap eroded trust with our users. Therefore, in the second iteration, we shifted to the more action-driven phrase "analyze deviation" as the CTA. While users found it less misleading, there remained a lack of clarity regarding what they should expect. In the final iteration, we refined the CTA copy to "View correlations," a modification that not only enhanced clarity but also elicited positive reactions from both our go-to-market team and users. This adjustment more accurately aligned user expectations with the functionality of the feature, addressing the initial challenge effectively.

πŸ‘― Build a good relationship with your GTM (go-to-market) team and lean on their expertise

Our GTM team played a crucial role in developing our correlation feature and enhancing our workflows. They gathered customer feedback and devised strategies to showcase and sell our product, amassing considerable knowledge and opinions about the feature's future direction. By building a strong relationship with them and seeking their insights, we acquired new perspectives. These insights ultimately led to the design of a better product for our customers.

πŸ† Data quality is also a part of the UX

A significant challenge in this project was designing for poor data quality. Our correlation feature relies heavily on the quality of our customers' data. The rule is straightforward: if you send your data in the exact format we requested, you'll get accurate results with ease. However, it was rare for our customers to have perfectly formatted data, leading to inaccurate and sometimes misleading results. Throughout this project, I strongly advocated against merely applying quick fixes such as adding tooltips or complicating the user interface. Instead, I suggested we invest in refining our algorithm to adapt to various data configurations. While this requires more effort, I'm pleased that the organization has realized the importance of investing in improving the underlying technology. This will make our product much more adaptable and user-friendly in the longer term.

Future: auto-detection and analysis

As of now, the feature still requires manual inputs - users need to click on a spike to launch it. In the future, we aim to have this feature always running in the background. This would allow us to identify abnormal spikes and correlations for our users, thereby reducing their MTTR (Mean Time To Recover) and making the process more effortless.