Distinguishing aria-live speech from UI accessible names

We just ran our first user test of an accessible UI prototype with blind and partially-sighted testers. In spite of my efforts to follow the various recommendations and WCAG specifications, there were major issues.

By far the greatest of these was the confusion which arose about the screen reader reporting changes to aria-live regions, and the contiguity of those spoken reports with the reading of the UI accessible names in response to key navigation.

I should point out that our product is a first-aid training simulator. (Spoiler: Act quickly, or the patient dies).

It is a training product for a very wide audience, but far more like a game than a web 'page', although we are using the browser as a runtime.

An aria-live region might report "A paramedic has entered the room" or "The patient has opened his eyes". This happens as an indirect (and delayed) response to the user's actions.

Our testers sought a relationship between these kinds of reports, and their keyboard input. I am sure the real life situation is confusing and hurried, but at least you can tell the difference between your hands and someone else.

The chief problem (as I see it) is that both of these semantically distinct sets of content were read with exactly the same synthesised voice, and with no gaps. A cacophony resulted. Button labels were read out in a contiguous stream with reports about what was going on in the simulated world.

Such confusion does not arise with sighted users of our product because the fictional/diegetic/simulated world simply 'looks different' to the GUI used to interact with it.

I'm confident that we can get the UI to behave in an understandable way, but I am pretty stumped about how we might use aria-live regions for content which updates more than once per second without the whole experience descending into cacophony.

I was using "polite" aria-live regions, a setting which (per spec) promises to allow some kind of figure-ground relationship between different kinds of content, rather than a babble of competing word salads.

Most of the discussions about aria-live seem to assume 'page-like' or 'document-like' content. I've followed their recommendations and the result was so disappointing that I am now searching for any alternatives. There is a bit of scene growing around 'accessible gaming', but it appears to be mostly gamers, rather than developers. Discussions about techniques and implementations are almost as rare as rocking-horse dung.

I know there is a (contentious) effort to get screen readers to support CSS3 Speech, so that different semantics may be 'styled' with different voices.

This would be a very fine (and standard!) solution to our problem, but it appears that the screen reader 'community' (developers, engineers and users) regards this as either a low-priority feature, or actively argues against it (for reasons that mostly don't apply in our case). Certainly there are no implementations out there which we can reasonably rely on.

So my question is this: How do we design the UX of a relatively fast-moving 'game-like' app so that aria-live regions from the in-fiction (diegetic) universe 'sound different' to the UI?

I have a few ideas.

  • Handle the in-fiction/diegetic stuff with our own accessible audio (e.g. pre-recorded mp3 files) rather than relying on the mercy of how screenreader's treat aria-live. (More audio is more?).

  • Prefix the in-fiction/diegetic stuff with some kind of distinct 'beep' or other brief sound effect.

  • Try and choreograph the changes to aria-live regions so they are far less likely to interfere with UI label readings. ("polite" on steroids).

  • Offer a special 'training level' so that screenreader users can discover the UI without the simultaneous urgency of saving the life of an imaginary patient.

Can anyone report on whether any of these are obvious canards, and perhaps suggest some other areas of exploration?