Getting Started with Statistics for UX

UX designers have a variety of methods for gathering feedback and iterating on their design, such as contextual interviews, persona creation, customer journey, storyboard, and etc. Some of the methods are prevailing and intuitive; they reveal a lot about user needs and stories.

Qualitative methods are driven by the urge to understand the users and empathize with them to create better design solutions. Nevertheless, qualitative methods aren’t always the right methods. Sometimes we need to step back and take a different perspective to understand the “who” and “what” of user behavior. That’s when quantitative data and statistics could help us with collecting “data” rather than just insight.

Recording data about the number of user errors and unfinished tasks during a usability test, and then using graphs to present the frequency and severity of usability issues is quite useful. However, once you go beyond graphs, averages, and percentages, you will need to delve into analysis and estimations with the help of inferential statistics.

As the name suggests, inferential statistics are used to make conclusions and generalizations about the population by taking a representative sample from it.

Implementing inferential statistics can help you find out how more confident you could be in your resulting UI design choice or prioritization.

Even if you’re only doing qualitative usability testing (concern with what the usability problems are, not how frequent they are), understanding the basics and principles of inferential statistics is useful. For instance, it could help to clarify to stakeholders why even 1 among 5 users failing a task, is justification for changing the UI intended for many thousands of users.

Inferential statistics are useful for clarifying to stakeholders why even 1 among 5 users failing a task, is justification for changing the UI intended for many thousands of users

Statistical analysis and inferential thinking are applied in a myriad of UX research methods, including usability tests, card sorting, surveys, and physiological testing like eye tracking and etc. Don’t panic though, in this article we will only discuss the use case scenarios of two frequently used statistical analyses and simplify their explanations as much as possible.

What statistics to know?

For interpreting research results such as surveys and usability tests, some inferential statistical procedures can cover most quantitative research methodologies. Statistical tools that begin with basic correlation and T-test (this will be discussed later) are fairly easy to access. In fact, if you have Excel or access to an online calculator, you probably have most of these tools already at your fingertips!

Nowadays many UX professionals come to the profession without formal statistical training. As a matter of fact, usability studies, contextual inquiries, surveys, and other UX research methods are sometimes performed on an ad hoc basis by designers, information architects, product managers, and front-end coders who have had no formal training in these methods, let alone training in the statistical tools used in the analysis of the data collected through such methods.

As mentioned, in this article, we will go through two frequently used and easy-to-understand statistical methods employed by UX practitioners and researchers and their use case scenarios in UX.

What type of data to look for?

As the nature of usability testing is qualitative, I would argue that you need to know very little statistical knowledge to understand and interpret usability measurements. However, some descriptive statistics such as the following could help you find the red flags:

Completion Rate: number of users completing a task)
Number of Errors: number of mistakes users make
Task duration: the time that it takes to complete a task

Descriptive statistics are used to summarize data, for instance, measures of frequency. Quantifying the aforementioned three factors should tell you if a user is capable of successfully completing a task with ease.

Severe issues are raised when they cannot complete a task, have many errors when completing a task, or when they take a long time to complete the task. These metrics help you understand if what was designed is indeed usable.

Quantitative research and metrics obtained from studies help you understand if what was designed is indeed usable.

It is worth mentioning that I am only going to go through each statistical method with its usage case scenario rather than the technicality of performing the statistical methods – but all of these could be easily calculated by Excel, free online calculators, or SPSS which is a software for statistical analysis.

Now let’s dive in and see what’s all the fuss about statistics in UX:

Independent sample T-tests

As the UXer on the team, you are probably frequently asked to determine which version of a design or feature is better, in essence, more usable. You may also be asked to determine which design is preferable on a variety of attributes, such as sophistication, trust, emotional connection, aesthetic appeal, and of course, its commercial potential.

“Design” could be a web page, the ever-important shopping-cart check-out process, a prototype, or a single important feature. In fact, this kind of comparison test might be one of the most common types of jobs of UX professionals.

So what should you do when the final design needs to be chosen? And that isn’t always easy without further, and perhaps sophisticated analysis.

Here is an example of such situations:
Consider you are comparing the perceived aesthetic appeal of 2 images for the homepage of your bike e-commerce website. You’ve recruited 20 subjects, 10 view and rate image A, and another 10 viewing and rating image B, in other words, you are surveying the design with two independent samples of participants.

Photos by Murillo de Paula and Carl Nenzen Loven from Unsplash

A sample is a small group of individuals chosen from the larger population.

For instance, if you have an e-commerce website for clothes and you have more than 10,000 customers, you are not going to survey all of the customers. Obviously, time and finances stop you from doing so, instead you choose 20 random people from the customer base. These 20 people are your sample.

At this point, you might ask why not have just a sample of 10 people rating both designs? Well, sometimes you have to eliminate any potential user bias originating from viewing one design after another. The first design affects the person, so they cannot give an objective evaluation of the second design.

Now back to your design, you have already obtained the perceived aesthetic appeal of 10 subjects per design. What you have to do next is to compare the mean of your two independent samples. In statistics mean is another name for average. By comparing the means or averages you want to determine the one that has the higher perception of aesthetic appeal.

Null and alternative hypothesis

To do so, we will create two hypotheses in an effort to decide between the two:

Null hypothesis and,
Alternative hypothesis

The null hypothesis is the default position. To put it simply, in our example the default position would be the assumption that there is no difference in the aesthetic appeal of the two images.

Statistically speaking, the default position is called the null hypothesis and its symbol is H0. And the symbol for sample mean or average is μ (mu).

So in our design example, this is how we can show it:

H0: μ1 = μ2

This is how we can read it: the null hypothesis (H0) expects equal sample means for design 1 (μ1) and design 2 (μ1). In other words, the two designs do not differ with respect to the aesthetic appeal of the two images.

In statistics, besides the null hypothesis, we also have the alternative hypothesis.

The alternative hypothesis is the one that we want to prove. Its symbol is H1 and it directly contradicts the null hypothesis, H0.

In our design case, the alternative hypothesis or H1 would indicate that there is a significant difference between the mean aesthetic appeal of the two images:

H1: μ1 ≠ μ2 (The two designs do indeed differ with respect to mean aesthetic appeal of the two images)

Significance means that the result is unlikely due to chance.

Photo from xkcd.com, used with permission

Mean and significant differences could be easily calculated by T-test in Excel or SPSS. T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related to certain features.

In our case, the means we are comparing is the perceived aesthetic appeal of image A versus image B.

If the alternative hypothesis (H1) is determined to be true, then we can conclude that the means of aesthetic appeal of the two images are different. Therefore, the design with the higher aesthetic mean will be chosen.

What we just covered is called an independent sample T-test. Having two different groups of people is the reason this approach is called “independent samples”. Basically, no one person is in both groups, and the mean score for one group of people is totally independent from the mean score for the other group.

Of course, at times it is appropriate for the same person to evaluate both designs, this may be a better choice. This is what I am covering next.

Paired sample T-tests

What we just did in the previous scenario was launching a survey with two different designs, one home page with picture A, and another with picture B, to two different groups and just sat back to see which one would win.

The reality is that we often don’t have the luxury of obtaining even the moderately small sample sizes. Why?

Because (1) it’s been established that larger sample sizes do not reveal more problems, and (2) conducting studies with large populations is both time consuming and expensive.

Sometimes you have to compare 2 designs in less time and that’s when you opt to compare Design A and Design B, one after another and with one sample.

Of course, you’ll counterbalance to ameliorate bias. With counterbalancing, the participant sample is divided in half, with one half seeing picture A and then picture B in one order, and the other half of the sample seeing picture B and then picture A in order. Counterbalancing is a technique used to deal with order effects when we have only one sample for evaluating two variations.

So this is one group of users and two designs: welcome to the world of paired-samples T-tests.

As opposed to the scenario with independent samples, the fundamental characteristic here is that each person evaluates both designs. So, in essence, the aesthetic appeal evaluations are “paired.”

One person provides two data points and we know which two came from a given person. As mentioned, you will be reducing the bias by having for instance 5 subjects viewing design A first and then B, and another 5 subjects viewing design B after design A.

When we are comparing two means and the data has the same people providing measurements of both alternatives (the aesthetic appeal rating of two designs, or the time for the performance of some task for two designs), we have what is referred to as “paired data” and hence, the statistical method is called paired sample T-tests.

The hypotheses are the same as in the Independent sample T-tests:

H0: μ1 = μ2 (The two designs do not differ with respect to mean aesthetic appeal of the two images)

H1: μ1 ≠ μ2 (The two designs do, indeed, differ with respect to mean aesthetic appeal of the two images)

Conclusion

By and large, detailed exploration of the underlying behavior of users has been left to qualitative techniques and analysis. However, quantitative and qualitative both have their place in UX validation. Implementing tasks and acquiring quantifiable responses can help with testing the hypotheses and generalize a finding to a representative sample of a population. In comparison with qualitative research methods or data from analytics, implementing some statistics into our everyday design research could help us with decision-making and predicting.

Quantitative techniques could be widely used in user experience research, especially taking into account the rich array of statistical techniques and available tools such as free online calculators and Excel.

References

Creswell, J. W. (2009). Research designs: Qualitative, quantitative, and mixed methods approaches.

Fritz, M., & Berger, P. D. (2015). Improving the user experience through practical data analytics: gain meaningful insight and increase your bottom line. Morgan Kaufmann.

Leech, N., Barrett, K., & Morgan, G. A. (2013). SPSS for intermediate statistics: Use and interpretation. Routledge.

Tags: an event apart, smashing, speckyboy, UX Interview, ux stack, UX Stack Exchange, UXBooth