Every Machine Learning Team Needs a UX Researcher

On the surface, the seemingly artsy human-centered world of design seems so distant from the math-based technical world of ML that the two just don’t seem to fit together. It wasn’t until I listened to a talk on the ethics of ML when it clicked, and it became clear that, in fact, just like any standard software development project, having a UX researcher integrated on an ML team will help ensure a successful application.

One of the big problems today is a general lack of understanding of ML fundamentals. ML is like some obscure magic that everyone talks about, but only a few have any knowledge of how ML works. This lack of knowledge makes it difficult for both senior management and product teams to realize the benefit of having a UX researcher involved in an ML project.

Failing to Consider the Human Factor in Personalization

Example of facial recognition software

Imagine working for a large bank that spent a lot of time and money developing a facial recognition system to identify and create tailored experiences for customers as they walk through the door. After launching, the bank quickly discovers that customers are disturbed that employees greet them by name and know details about their account before the customer has said anything or has shown identification. For customers who frequent the bank, this might go unnoticed. However, customers who never have or don’t often go to that branch might find it disturbing.

The bank is in a difficult spot. Do they double down and launch an advertising campaign to make customers aware of what they are doing and its benefits? This might work, but without research, it’s difficult to know. The other option could be to scrap the entire program. Having a UX researcher on the team could help; UX researchers are always eager to talk to users before the development team starts. In this example, with only five-to-ten customer interviews, a UX researcher would have been able to identify several issues with the bank’s plan allowing them to address them before moving forward or deciding to scrap the project.

Critical Points in the ML Workflow Where UX Should be Involved

There are three critical points in the ML workflow where UX should be involved:

  1. At the initial idea phase of the project: a UX researcher can help validate that the project makes sense to users and help identify any potential issues before the ML team starts collecting data or building a model.
  2. At the data collection stage, utilizing the initial interviews and research findings, a UX researcher can help the ML team understand gaps or holes in the data.
  3. A UX researcher can evaluate if the model meets the users’ needs and expectations after the model has been trained.

In the next section, we will take a more in-depth look at these three areas.

At the Idea Phase of the Project

In an effort to gather and validate as much information as possible about the project goals, user needs, and expectations, below are questions a UX researcher can aid the team in answering.

  • What problem or opportunity is the ML application trying to solve?
  • Does the team have access to end-users? Having the UX researcher conduct five to ten thirty-minute interviews with end-users is essential to understanding if the project’s strategy makes sense or if a pivot is necessary.
  • Are there any existing applications, systems, or models that the team can leverage?
  • Where is the model being deployed?
    Will the end-user be sitting at a computer with a high-speed connection, on their phone in the middle of a farmer’s field scanning plants for diseases, or in a factory setting using embedded hardware.
  • How explainable does the model need to be?

Suppose a Deep Neural Network is being used to recommend Type II diabetes medication based on a patient’s clinical values. Is the doctor going to accept the model’s recommendation with no clear insights, or does the doctor require a clear path detailing how the ML model made the recommendation? If the model is not explainable, it is difficult to know if it is safe.

How accurate do users expect the model to be?

If an ML model designed to predict a cars’ price based on its features is accurate within a few hundred dollars, that’s likely acceptable to most users. However, an ML model used for medical imaging should be extremely accurate.

How fast does the model need to make predictions?

Is it okay if the model takes a few seconds, or does it need to be instantaneous? Imagine having to wait several seconds before your smart doorbell is able to realize that a person is standing at your doorstep. It’s important to note that high accuracy can come at the cost of speed and vice versa. Often there is a balance between accuracy and speed.

What is the impact when the ML model makes an incorrect prediction?

Since no models are perfect, it’s crucial to think about what could happen when the model makes a wrong prediction. In a test by the Massachusetts ACLU, one out of six images of famous athletes from either the Boston Bruins, Boston Celtics, Boston Red Sox, or New England Patriots were falsely identified as individuals in a database of 20,000 public arrest photos [2]. Imagine if this same facial recognition model was deployed for use at traffic stops by police departments.

After conducting end-user interviews and thinking critically through the questions above, the team should have a clear understanding of how to move forward. If there is any ambiguity or disagreement, the team should pause and do more UX research to address any issues before building a dataset.

The ML Dataset

All ML models need data, and depending on the type of ML used, a substantial amount of data may be necessary. But how can a UX researcher help with creating a dataset? This might seem like the ML or data engineers’ job, and sure, they will do most of the heavy lifting, but some vital UX considerations must be given when preparing or evaluating a dataset.

Is the dataset accurate?

IBM Watson for Oncology came under heavy criticism for recommending unsafe or inaccurate treatments [7]. It was discovered that IBM trained the model using synthetic patient data created by Doctors at Memorial Slone Kettering (MSK) Cancer Center rather than using real patient data. Paired with ML model explainability struggles, it’s easy to see why clinicians would be reluctant to accept ML technology into their daily workflows.

Is the dataset equitable?

This might seem like an easy question to answer, but let’s take a more in-depth look. If the city road commission wants to identify where all of the potholes are in a city, they might create a phone app where users can report potholes and send the GPS location back to the city. This sounds like a good idea; however, smartphone users tend to be younger and have more money [3]. This leaves areas in the city that have an older population or lower-income population underrepresented. It is essential to not only think about who is included in the dataset but who is potentially excluded or underrepresented.

What is the likelihood that the ML model’s input data will shift at some point in the future?

If the data does change, how easy will it be to detect the difference and the impacts on the end-users? Imagine a real estate company in the mid-west created a linear regression model that helped its agents determine a selling price for houses based on their features. The company was later acquired by a national real estate company that hoped to use the model across the country. But the prices of houses in the midwest were much different from houses in California. The model was no longer accurate in certain areas and needed to be retrained with additional data. Thankfully the resulting price inaccuracy was easy to spot. This may not always be the case.

Are there any legal or contractual limitations that prevent an existing set of data from being used in an ML model?

In our bank example, imagine that the bank took a picture of each client when they opened an account to ensure security. Is the bank able to use these images in their facial recognition model? Recently, Google and Ascension healthcare came under fire when the Wall Street Journal reported that Ascension has been sending their clinical data, including patient names, birthdates, labs, diagnoses, and more, to Google without the consent of Ascension doctors or patients. Although, this is legal under the Health Insurance Portability and Accountability Act of 1998. It’s essential to stop and think about how patients feel knowing that their health data is being sent to another company that hopes to benefit financially without their consent [1].

Does the dataset contain data like age, race, or gender?

Using this data in an ML model could violate US discrimination and civil rights laws depending on the ML application [5]. Additionally, using data that is not protected but strongly correlates to protected class data could land a company in hot water. Take someone’s first name; many first names are popular at certain times and eventually go out of favor; therefore, it is possible to infer someone’s age by using their name [6].

Is the data subjective?

While designing an ML dataset to filter through job applicants, would HR employees agree on what features make a successful employee? Is a college degree more or less important than work experience? What if that college degree is in a completely different field than what the applicant is applying for? The reasonable answer is that it depends. In situations like this, extra care must be used when proceeding, perhaps a more generalized model is created, and humans take over from there.

Does the dataset need to be labeled?

If yes, who is going to do that work, and are there any concerns? It might seem straightforward to label data, but even experts might disagree on the classification of a piece of data. Imagine two doctors disagreeing about their interpretation of images taken of a patient’s eye that are used to diagnose diabetic retinopathy and the downstream impacts there could be on model accuracy.

It cannot be overstated how important it is that the dataset is reviewed. This analysis really could be the difference between a successful product and making the Wall Street Journal for discrimination. Utilizing the UX work done at the beginning of the project, a UX researcher can help the team understand if the data aligns with the users’ needs and help identify any quality or ethical issues.

Once all of the data issues have been resolved, it’s time for the ML engineers to design and train the model. Once the ML engineers have the model, it’s time for UX to usability test the model. The idea of usability testing an ML model might seem odd, but there are many similarities to standard usability testing and evaluation.

In a perfect world, every ML model would be 100% accurate at making predictions and lightning-fast. In reality, especially when deploying on embedded hardware, there is a trade-off between accuracy and speed.

  • Accuracy: Having an accurate model is essential but can sometimes come at the cost of speed. The accuracy of an ML model used to diagnose eye issues in retina scans must be high.
  • Speed: How fast does the model return a prediction. Imagine that every time someone walked up to your smart doorbell, it took a minute before the model predicted that a person was standing at your door. This prediction needs to happen within a second or two.
  • Accuracy vs. speed: it can be a bit of a give and take between accuracy and speed, especially when ML is deployed to smaller embedded hardware. Sometimes trade-offs have to be made, and having done the UX research will give the team a good sense of where to start.
  • Errors: Even if an ML model is 99.9% accurate, there are no guarantees that it won’t make an erroneous prediction. Researchers found the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) tool used in state court systems during sentencing to help judges understand a defendant’s risk of recidivism (the tendency of a convicted criminal to reoffend) or violent recidivism was at best accurate 61% of the time and for violent reductivism only accurate 21% of the time [4]. Additionally, the analysis determined the tool incorrectly classified black defendants in a higher risk category while white defendants were classified in a lower risk category. In cases like these, the errors produced by a model could have a significant impact on people’s lives.

The UX evaluation of the model will help determine the next steps and help set expectations for performance and acceptance. If the model falls short of the end-user needs or expectations, the ML team must improve the model. This could be refining the dataset or refining the model itself. The team should repeat this process until the model passes a usability evaluation.

The world of ML is complicated and expanding every day. It’s easy to get caught up on the technical aspects and lose track of the user-centric issues. For an ML project to be successful, there needs to be technological know-how with UX guidance to ensure that the right work is done. When an ML project goes wrong, the best that might happen is that users just won’t use it, but the worst case is a profound negative impact on someone’s life.

What Can be Done?

It’s doubtful that anyone would set out to build ML models that don’t meet their users’ needs, are biased, or inaccurate. Instead, teams likely find themselves here due to a general lack of knowledge. Individuals and companies focus on the technical tools and know-how to build ML and miss the equally crucial human side. What can be done to help prevent ML catastrophes?

Learn

Google, Microsoft, and others have resources to help teams learn about responsible AI. Additionally, they provide tools to help teams evaluate their datasets and models.

Adopt or create a set of standards and ethics for responsible ML

Like Google, Microsoft and others, consider what responsible ML means to your company or team and write it down.

Be robust

The agile mindset has trained us to be lean and get a minimum viable product out to the market as quickly as possible. While adopting agile principles is fine, it cannot come at the expense of quality, and as we have demonstrated, this is critical in the ML space.

Test, test, and then test some more

Beyond the usability testing, test what happens when the data shifts, test what happens when the data input has errors. Ensure that the ML model is working as intended under all conditions and can be trusted.

References

[1] Copeland, R. (2019, November 11). Google’s ‘Project Nightingale’ Gathers Personal Health Data on Millions of Americans. WSJ. https://www.wsj.com/articles/google-s-secret-project-nightingale-gathers-personal-health-data-on-millions-of-americans-11573496790

[2] Facial recognition technology falsely identifies famous athletes. (2019, October 23). ACLU Massachusetts. https://www.aclum.org/en/news/facial-recognition-technology-falsely-identifies-famous-athletes/#athletes

[3] Pew Research Center. (2019, May 12). Mobile Fact Sheet. https://owl.purdue.edu/owl/research_and_citation/apa_style/apa_formatting_and_style_guide/reference_list_electronic_sources.html

[4] ProPublica. (2020, February 29). How We Analyzed the COMPAS Recidivism Algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

[5] Protected Classes in Employment Discrimination. (n.d.). HG.Org. https://www.hg.org/legal-articles/protected-classes-in-employment-discrimination-30939

[6] Silver, N., & McCann, A. (2014, May 28). Protected Classes in Employment Discrimination. FiveThirtyEight. https://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/

[7] Swetlitz, I., & Brodwin, E. (2018, July 30). IBM’s Watson recommended “unsafe and incorrect” cancer treatments. STAT. https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/