Observability tools are like a safety net for engineers. They get peace of mind knowing everything's running smoothly, and a jump start on fixing stuff when it's not. Engineers might not be glued to these tools all day, but they're super important for keeping their systems and apps in steady states.
The process of incident management can be distilled into five key stages: monitoring, alerting, triaging, investigating, and remediating. Our team zeroes in on the investigation stage. That's when engineers are busy with loads of browser tabs, diving into the details, comparing chart, and unraveling the root cause of the problem.
Over the last two years, we've seen our product evolve towards a 'kitchen sink', with a variety of features but a lack of coherent workflow. To rectify this, we have to phase out outdated features and grow newer ones to unify workflows without compromising the product experience.
Our product began as a tracing-only tool, with our correlations features, RCA and correlation panel, designed to analyze trace rates. We later added metrics and launched a metric correlation feature called Change Intelligence. This feature combines metric data with trace rate data to analyze metric spikes. However, this addition increased the product's learning curve as users now had to understand three correlation features.
Over the years, we have received direct feedback from our users that our UI for correlation features, including the latest one, Change Intelligence, was unintuitive and required a steep learning curve. It was built under many assumptions and technical constraints, with very limited time. Users struggled to navigate the feature and understand its functionality, thereby diminishing its value.
To streamline our workflows and increase the usability and usefulness of our correlations feature, we need to enhance and upgrade our latest feature, Change Intelligence, while phasing out the old correlation feature, RCA and correlation panel.
To understand how research findings influenced design decisions, please continue reading the motivations in the iterative implementation section β¬οΈ .
Given the magnitude and complexity of the problem, we planned out four milestones. These milestones are designed to address and solve the problem incrementally, ensuring thorough consideration and attention to each. This allowed us to stay organized and kept us on track towards our main goal.
After the multi-milestone project was done, we took some time to let the feedback bake from both our internal and external users. And did a final round of feedback synthesis and update including updating some copy, rearrange some charts, and fixing minor bugs.
This round of improvements not only boosted discoverability but also delivered a more user-friendly and comprehensive experience that works across different telemetry. Beyond usage data, these enhancements played a pivotal role in securing new contracts and upselling our product, establishing it as a distinctive offering in the market.