How the Jira design team quantified the impact of our design changes at scale and how we built confidence in our designs through task-based usability testing.
As the Atlassian design team matured in recent years, there was a view that we needed to get better at measuring and understanding the impact of our design changes. We did a reasonable job at measuring the success of individual features we shipped, but we struggled to measure the impact that the design team was having on Jira at a macro level.
Across the company we were capturing Net Promoter Score (NPS), which served us as a useful indication of customer loyalty and referrals, but it also was problematic for design. The design team was analysing comments from NPS to assess the system-wide usability of a product, using a metric that was really only designed to measure customer loyalty. While some of the NPS comments were useful to us, the methodology used in the calculation of NPS is problematic (as others have pointed out) when used to assess the user experience: suggesting that a 0/10 score is equivalent to a 6/10 score when testing an experience makes little sense.
We needed a better way to measure the impact of design at a systems level. Prompted by the introduction of massive changes to the visual design language of Jira, we started investigating how best to assess the impact of these changes on perceived usability. We looked at System Usability Scale (SUS) and the Usability Metric for User Experience (UMUX). Ultimately, we needed to make the trade-off of the specificity and reliability of a lengthy survey mechanism, like SUS, with something more succinct could be used to survey users in-product with a high response rate.
We decided to experiment with using UMUX-Lite, a simplified version of UMUX, that studies had shown to be highly correlated with SUS scores and NPS. In a nutshell, the UMUX-Lite questionnaire plainly asks users to rate their agreement with two statements on a seven point scale:
We knew from our research and feedback that some users found the feature set of Jira to be powerful, but that some users also found Jira difficult to use. The UMUX-Lite questionnaire seemed to be the perfect format to balance these two concerns that we were trying to address, and it could be delivered in a quick and unobtrusive way to our users, at scale.
We instrumented the survey in-product and started measuring and collecting feedback. We immediately saw an increase in the quality and useful of the responses in UMUX-Lite comments when compared to NPS. Our instrumentation also allowed us to segment scores for users that engaged with the specific features and or behaved in certain ways. This was incredibly useful for measuring the perceived usability impact of the new issue design as we rolled it out when compared to the old design.
In an effort to correlate the impact of our designs at a feature level to a systems level, we experimented with using UMUX-Lite in task-based user testing. The Jira team was regularly running moderated and unmoderated usability tests of our designs. At the end of these tests, we would use the UMUX-Lite questionnaire to help assess whether a test was successful. However, there were a few problems that we discovered along the way with using UMUX-Lite in task-based testing.
The original paper intended UMUX-Lite to be a way to measure the usability of an entire system or product. We found that modifying the questionnaire for a specific feature, like “The capabilities of search meets my requirements” and applying it in a test where you have a set a user a specific task becomes very contrived. We observed that some participants would simply use the statement as a proxy for validating their task as successful or as a failure. Similarly, some participants would evaluate their requirements based on information unrelated to the test, like their occupation and role.
As the UMUX-Lite score is a calculated based on two statements, the improper application of this question made the scores we reported inherently flawed. Incidentally, the same is true for using the NPS questionnaire in a task-based test. We also realised that we were getting no benefit from using the shorter UMUX-Lite questionnaire in task-based testing over other multi-question surveys like SUS, as we were not concerned with the time it took our incentivised participants to complete a questionnaire.
Upon reflection we came to understand these mistakes and moved to adopt a more thoughtful questionnaire that would have users rate their agreement of statements based on a five point scale using:
The success of a test would consider the perceived usability of responses from the questionnaire alongside outcome-based metrics like the success rate (the number of users who successfully completed a task) and the time on task (the average amount of time it took for a task to be completed).
When we experimented with using UMUX-Lite in task-based user testing, it did cause some misunderstanding within the company. Many had naively assumed that because a particular design we tested saw an increase in UMUX-Lite, that once this design shipped we would naturally see that increase reflected in the scores we were collecting in our instrumented surveys. Through employing a staged rollout strategy utilising the principles of A/B testing, we were able to measure the impact of larger design changes on UMUX-Lite, like when we released a major update to Jira’s visual design language. However, products are of course made up of many features and the measurable perceived usability impact of a one small feature within the sum of many features in an entire system can be low.
The combination of SUS and design qualities in task-based testing proved to a useful way for designers to get a signal on if a particular design direction was better than another. It helped us get further confidence and validation that a design was sound, and build trust in our designs with product management and engineering.
Reflecting on the journey, I don’t have a dogmatic belief about a particular methodology, process or metric. I think it really comes down to deciding what is appropriate for the product or service. For a service where loyalty is important, NPS might be the perfect metric. For an ecommerce website, SUPR-Q might give you a more wholistic evaluation. For Jira, we found that UMUX-Lite was the ideal lightweight questionnaire, that could be instrumented at scale and would allow us to see the impact of our design changes over time. And in our task-based user testing, SUS in combination with an evaluation against our design qualities, gave us the confidence we needed to move forward with a proposed design.