Chapter 3. It’s Alive: Rich Content Accessibility

It’s now time to turn our attention to the features that EPUB 3 introduces to expand on the traditional reading experience; the excitement around EPUB 3 doesn’t come from text and images, after all.

This section takes focus on the dynamic aspects of EPUB 3. Rich media, audio integration, and scripted interactivity are all new features that have been added in this version. Some of these features, like audio and video support and scripting, introduce new accessibility challenges, while others, like overlaying audio on your text content and enhancing text-to-speech rendering, improve access for all. (The members of this latter group are also commonly referred to as accessibility superstructures, because they are added on top of core EPUB content to enhance accessibility.)

But let’s get back to business of being accessible…

The Sound and the Fury: Audio and Video

The new built-in support for audio and video in EPUB 3 has its pros and cons from both mainstream and accessibility perspectives. The elements simplify the process of embedding multimedia, but come at the expense of complicating interoperability, and by extension accessibility—specifically as relates to video.

There is currently no solution for the general accessibility problem of video, namely that not all reading systems may natively play your content. The video element permits any video format to be specified, but not all reading systems will support all formats. Support for one or both of the VP8 codec (used in WebM video) and H.264 codec is encouraged in the EPUB specification to improve interoperability, but you still have to be aware that if you have an EPUB with WebM video and a reading system that only supports H.264 you won’t be able to view the video.

Until consensus on codec and container support can be found, there is no easy solution to this problem. You can try targeting your video format to the distribution channel, but that assumes that the readers buying from the online bookstore will use the reading system you expect, which isn’t a given. Even seemingly-simple solutions, like duplicating the format of all video content, are only feasible on small scales and don’t take into account the potential cost involved.

But that was more of an aside to say that if you don’t think accessibility is worth your time and effort, consider there may be a larger audience than you might expect that could be relying on your fallbacks in the near term.

Playability issues aside, though, HTML5 is still a leap forward in terms of multimedia support. It’s fair to assert that no one will miss plugins for rendering audio and video content, certainly not on the accessibility side of the fence. From roach-motel players that let you navigate in but never let you leave to players lacking keyboard support to utter black holes, the accessibility community typically does not have a lot of good things to say about multimedia as deployed on the Web.

That the new native elements can be controlled by the reading system in EPUB 3 should translate into greater accessibility, however. To enable the default system controls, you need only add a controls attribute to the element:

<video … controls="controls">

That these native controls vary in appearance from reading system to reading system, however, leads to a natural tendency to script custom players. There’s nothing wrong from an accessibility perspective in doing so, so long as your developers are fluent with WAI-ARIA and ensure the custom controls are fully accessible.

But if you do create a custom control set using JavaScript, ensure that you still enable the native browser controls in the audio and video elements by default. If you don’t, only readers with JavaScript-enabled systems will be able to access the audio or video content, and maybe only some of them. Depending on what scripting capabilities are available, even script-enabled systems may not render your controls. You can always disable the native controls in your JavaScript code if the system supports your custom controls.

Timed Tracks

Improved access to the content and the playback controls is only one half of the problem; your content still needs to be accessible to be useful. To this end, both the audio and video elements allow timed text tracks to be embedded using the HTML5 track element.

If you’re wondering what timed text tracks are, though, you’re probably more familiar with their practical names, like captions, subtitles, and descriptions. A timed track provides the instructions on how to synchronize text (or its rendering) with an audio or video resource: to overlay text as a video plays, to include synthesized voice descriptions, to provide signed descriptions, to allow navigation within the resource, etc.

As I touched on when talking about accessibility at the start of the guide, don’t underestimate the usefulness of subtitles and captions. They are not a niche accessibility need. There are many cases where a reader would prefer not to be bothered with the noise while reading, are reading in an environment where it would bother others to enable sound, or are unable to hear clearly or accurately what is going on because of background noise (e.g., on a subway, bus, or airplane). The irritation they will feel at having to return to the video later when they are in a more amenable environment pales next to someone who is not provided any access to that information.

It probably bears repeating at this point, too, that subtitles and captions are not the same thing, and both have important uses that necessitate their inclusion. Subtitles provide the dialogue being spoken, whether in the same language as in the video or translated, and there’s typically an assumption the reader is aware which person is speaking. Captions, however, are descriptive and provide ambient and other context useful for someone who can’t hear what else might be going on in the video in addition to the dialogue (which typically will shift location on the screen to reflect the person speaking).

A typical aside at this point would be to show a simple example of how to create one of these tracks using one of the many available technologies, but plenty of these kinds of examples abound on the Web. Understanding a bit of the technology is not a bad thing, but, similar to writing effective descriptions for images, the bigger issue is having the experience and knowledge about the target audience to create meaningful and useful captions and descriptions. These issues are outside the realm of EPUB 3, so the only advice I’ll give is if you don’t have the expertise, engage those who do. Transcription costs are probably much less than you’d expect, especially considering the small amounts of video and audio ebooks will likely include.

We’ll instead turn our attention to how these tracks can be attached to your audio or video content using the track element. The following example shows a subtitle and caption track being added to a video:

<video width="320" height="180" controls="controls">
    <source src="video/v001.webm" type='video/webm; codecs="vp8, vorbis"'/>
    <track
        kind="subtitles"
        src="video/captions/en/v001.vtt"
        srclang="en"
        label="English"/>
    <track
        kind="captions"
        src="video/captions/en/v001.cc.vtt"
        srclang="en"
        label="English"/>
</video>

The first three attributes on the track element provide information about the relation to the referenced video resource: the kind attribute indicates the nature of the timed track you’re attaching; the src attribute provides the location of the timed track in the EPUB container; and the srclang attribute indicates the language of that track.

The label attribute differs in that it provides the text to render when presenting the options the reader can select from. The value, as you might expect, is that you aren’t limited to a single version of any one type of track so long as each has a unique label. We could expand our previous example to include translated French subtitles as follows:

<video width="320" height="180" controls="controls">
    <source src="video/v001.webm" type='video/webm; codecs="vp8, vorbis"'/>
    <track
        kind="subtitles"
        src="video/captions/en/v001.vtt"
        srclang="en"
        label="English"/>
    <track
        kind="captions"
        src="video/captions/en/v001.cc.vtt"
        srclang="en"
        label="English"/>
    <track
        kind="subtitles"
        src="video/captions/fr/v001.vtt"
        srclang="fr"
        label="Fran&#xE7;ais"/>
</video>

I’ve intentionally only used the language name for the label here to highlight one of the prime deficiencies of the track element for accessibility purposes, however. Different disabilities have different needs, and how you caption a video for someone who is deaf is not necessarily how you might caption it for someone with cognitive disabilities, for example.

The weak semantics of the label attribute are unfortunately all that is available to convey the target audience. The HTML5 specification, for example, currently includes the following track for captions (fixed to be XHTML-compliant):

<track
    kind="captions"
    src="brave.en.hoh.vtt"
    srclang="en"
    label="English for the Hard of Hearing"/>

You can match the kind of track and language to a reader’s preferences, but you can’t make finer distinctions about who is the intended audience without reading the label. Machines not only haven’t mastered the art of reading, but native speakers find many ways to say the same thing, scuttling heuristic tests.

The result is that reading systems are going to be limited in terms of being able to automatically enable the appropriate captioning for any given user. In reality, getting one caption track would be a huge step forward compared to the Web, but it takes away a tool from those who do target these reader groups and introduces a frustration for the readers in that they have to turn on the proper captioning for each video.

I mentioned the difference between subtitles and captions at the outset, but the kind attribute can additionally take the following two values of note:

descriptions — specifying this value indicates that the track contains a text description of the video. A descriptions track is designed to provide missing information to readers who can hear the audio but not see the video (which includes blind and low-vision readers, but also anyone for whom the video display is obscured or not available). The track is intended to be voiced by a text-to-speech engine.
chapters — a chapters track includes navigational aid within the resource. If your audio or video is structured in a meaningful way (e.g., scenes), adding a chapters track will enable readers of all abilities to more easily navigate through it.

But now I’m going to trip you up a bit. The downside of the track element that I’ve been trying to hold off on is that it remains unsupported in browser cores as of writing (at least natively), which means EPUB readers also may not support tracks right away. There are some JavaScript libraries that claim to be able to provide support now (polyfills, as they’re colloquially called), but that assumes the reader has a JavaScript-enabled reading system.

Embedding the tracks directly in your video resources is, of course, another option if native support does not materialize right away.