What to do if SUS scores contradict qualitative feedback?

TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?

The following experiment is conducted to evaluate the usability of a web-interface:

  1. Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)
  2. Give them a SUS form to fill out
  3. After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)

So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.

It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:

SUS survey results (confidence at 95%)

You can see that the v2 score is lower than v1.

These findings are puzzling, because:

  • the qualitative feedback I got from participants was more positive in v2
  • the changes between v1 and v2 were not ground-breaking, e.g.:

    • added tooltips to widgets
    • increased the contrast to make the active tab more prominent
    • changed wording to avoid technical jargon
    • shortened text
  • nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site

In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.

Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.

Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).

At the same time, a t-test says that the differences between the scores of v1 and v2 are not statistically significant.

How could one make sense of these results?


Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.

A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.