We’ll instead turn our attention to how these tracks can be attached to your audio or video content using the track element. The following example shows a subtitle and caption track being added to a video:

<video width="320" height="180" controls="controls">
    <source src="video/v001.webm" type='video/webm; codecs="vp8, vorbis"'/>
    <track
        kind="subtitles"
        src="video/captions/en/v001.vtt"
        srclang="en"
        label="English"/>
    <track
        kind="captions"
        src="video/captions/en/v001.cc.vtt"
        srclang="en"
        label="English"/>
</video>

The first three attributes on the track element provide information about the relation to the referenced video resource: the kind attribute indicates the nature of the timed track you’re attaching; the src attribute provides the location of the timed track in the EPUB container; and the srclang attribute indicates the language of that track.

The label attribute differs in that it provides the text to render when presenting the options the reader can select from. The value, as you might expect, is that you aren’t limited to a single version of any one type of track so long as each has a unique label. We could expand our previous example to include translated French subtitles as follows:

<video width="320" height="180" controls="controls">
    <source src="video/v001.webm" type='video/webm; codecs="vp8, vorbis"'/>
    <track
        kind="subtitles"
        src="video/captions/en/v001.vtt"
        srclang="en"
        label="English"/>
    <track
        kind="captions"
        src="video/captions/en/v001.cc.vtt"
        srclang="en"
        label="English"/>
    <track
        kind="subtitles"
        src="video/captions/fr/v001.vtt"
        srclang="fr"
        label="Fran&#xE7;ais"/>
</video>

I’ve intentionally only used the language name for the label here to highlight one of the prime deficiencies of the track element for accessibility purposes, however. Different disabilities have different needs, and how you caption a video for someone who is deaf is not necessarily how you might caption it for someone with cognitive disabilities, for example.

The weak semantics of the label attribute are unfortunately all that is available to convey the target audience. The HTML5 specification, for example, currently includes the following track for captions (fixed to be XHTML-compliant):

<track
    kind="captions"
    src="brave.en.hoh.vtt"
    srclang="en"
    label="English for the Hard of Hearing"/>

You can match the kind of track and language to a reader’s preferences, but you can’t make finer distinctions about who is the intended audience without reading the label. Machines not only haven’t mastered the art of reading, but native speakers find many ways to say the same thing, scuttling heuristic tests.

The result is that reading systems are going to be limited in terms of being able to automatically enable the appropriate captioning for any given user. In reality, getting one caption track would be a huge step forward compared to the Web, but it takes away a tool from those who do target these reader groups and introduces a frustration for the readers in that they have to turn on the proper captioning for each video.