IBM Watson Speech Services for Web Browsers
Allows you to easily add voice recognition and synthesis to any web app with minimal code.
Warning This library is still has a few rough edges and may yet see breaking changes.
For Web Browsers Only This library is primarily intended for use in browsers. Check out watson-developer-cloud to use Watson services (speech and others) from Node.js.
However, a server-side component is required to generate auth tokens. The examples/ folder includes example Node.js and Python servers, and SDKs are available for Node.js, Java, Python, and there is also a REST API.
See several examples at https://github.com/watson-developer-cloud/speech-javascript-sdk/tree/master/examples
This library is built with browserify and easy to use in browserify-based projects (npm install --save watson-speech), but you can also grab the compiled bundle from the
dist/ folder and use it as a standalone library.
Basic API
Complete API docs should be published at http://watson-developer-cloud.github.io/speech-javascript-sdk/
All API methods require an auth token that must be generated server-side. (Snp teee examples/token-server.js for a basic example.)
WatsonSpeech.TextToSpeech
.synthesize({text, token}) -> <audio>
Speaks the supplied text through an automatically-created <audio> element.
Currently limited to text that can fit within a GET URL (this is particularly an issue on Internet Explorer before Windows 10
where the max length is around 1000 characters after the token is accounted for.)
Options:
- text - the text to transcribe // todo: list supported languages
- voice - the desired playback voice's name - see .getVoices(). Note that the voices are language-specific.
- autoPlay - set to false to prevent the audio from automatically playing
.getVoices() -> Promise
Returns a promise that resolves to an array of objects containing the name, language, gender, and other details for each voice.
Requireswindow.fetch, a pollyfill for IE/Edge and older Chrome/Firefox.
WatsonSpeech.SpeechToText
.recognizeMicrophone({token}) -> RecognizeStream
Options:
keepMic: if true, preserves the MicrophoneStream for subsequent calls, preventing additional permissions requests in Firefox- Other options passed to MediaElementAudioStream and RecognizeStream
- Other options passed to WritableElementStream if
options.outputElementis set
Requires the getUserMedia API, so limited browser compatibility (see http://caniuse.com/#search=getusermedia)
Also note that Chrome requires https (with a few exceptions for localhost and such) - see https://www.chromium.org/Home/chromium-security/prefer-secure-origins-for-powerful-new-features
Pipes results through a {FormatStream} by default, set options.format=false to disable.
Known issue: Firefox continues to display a microphone icon in the address bar after recording has ceased. This is a browser bug.
.recognizeFile({data, token}) -> RecognizeStream
Can recognize and optionally attempt to play a File or Blob
(such as from an <input type="file"/> or from an ajax request.)
Options:
data: aBloborFileinstance.play: (optional, default=false) Attempt to also play the file locally while uploading it for transcription- Other options passed to RecognizeStream
- Other options passed to WritableElementStream if
options.outputElementis set
playrequires that the browser support the format; most browsers support wav and ogg/opus, but not flac.)
Will emit a playback-error on the RecognizeStream if playback fails.
Playback will automatically stop when .stop() is called on the RecognizeStream.
Pipes results through a {TimingStream} by if options.play=true, set options.realtime=false to disable.
Pipes results through a {FormatStream} by default, set options.format=false to disable.
Class RecognizeStream()
A Node.js-style stream of the final text, with some helpers and extra events built in.
RecognizeStream is generally not instantiated directly but rather returned as the result of calling one of the recognize* methods.
The RecognizeStream waits until after receiving data to open a connection.
If no content-type option is set, it will attempt to parse the first chunk of data to determine type.
See speech-to-text/recognize-stream.js for other options.
Methods
.promise(): returns a promise that will resolve to the final text. Note that you must either setcontinuous: falseor call.stop()on the stream to make the promise resolve in a timely manner..stop(): stops the stream. No more data will be sent, but the stream may still receive additional results with the transcription of already-sent audio. Standardcloseevent will fire once the underlying websocket is closed andendonce all of the data is consumed.
Events
Follows standard Node.js stream events, in particular:
data: emits either final Strings or final/interim result objects depending on if the stream is in objectModeend: emitted once all data has been consumed.
(Note: there are several custom events, but they are deprecated or intended for internal usage)
Class FormatStream()
Pipe a RecognizeStream to a format stream, and the resulting text and results events will have basic formatting applied:
- Capitalize the first word of each sentence
- Add a period to the end
- Fix any "cruft" in the transcription
- A few other tweaks for asian languages and such.
Inherits .promise() from the RecognizeStream.
Class TimingStream()
For use with .recognizeFile({play: true}) - slows the results down to match the audio. Pipe in the RecognizeStream (or FormatStream) and listen for results as usual.
Inherits .promise() from the RecognizeStream.
Class WritableElementStream()
Accepts input from RecognizeStream() and friends, writes text to supplied outputElement.
Changelog
v0.15
- Removed
SpeechToText.recognizeElement()due to quality issues - Added
options.elementto TextToSpeech.synthesize() to support playing through exiting elements - Fixed a couple of bugs in the TimingStream
v0.14
- Moved getUserMedia shim to a standalone library
- added a python token server example
v0.13
- Fixed bug where
continuous: falsedidn't close the microphone at end of recognition - Added
keepMicoption torecognizeMicrophone()to prevent multiple permission popups in firefox
v0.12
- Added
autoPlayoption tosynthesize() - Added proper parameter filtering to
synthesize()
v0.11
- renamed
recognizeBlobtorecognizeFileto make the primary usage more apparent - Added support for
<input>and<textarea>elements when using thetargetElementoption (or aWritableElementStream) - For objectMode, changed defaults for
word_confidencetofalse,alternativesto1, andtimingto off unless required forrealtimeoption. - Fixed bug with calling
.promise()onobjectModestreams - Fixed bug with calling
.promise()onrecognizeFile({play: true})
v0.10
- Added ability to write text directly to targetElement, updated examples to use this
- converted examples from jQuery to vanilla JS (w/ fetch pollyfill when necessary)
- significantly improved JSDoc
v0.9
- Added basic text to speech support
v0.8
- deprecated
resultevents in favor ofobjectMode. - renamed the
autoplayoption toautoPlayonrecognizeElement()(capital P)
v0.7
- Changed
playFileoption ofrecognizeBlob()to justplay, corrected default - Added
options.format=truetorecognize*()to pipe text through a FormatStream - Added
options.realtime=options.playtorecognizeBlob()to automatically pipe results through a TimingStream when playing locally - Added
closeandendevents to TimingStream - Added
delayoption toTimingStream - Moved compiled binary to GitHub Releases (in addition to uncompiled source on npm).
- Misc. doc and internal improvements
todo
- Solidify API
- break components into standalone npm modules where it makes sense
- run integration tests on travis (fall back to offline server for pull requests)
- more tests in general
- better cross-browser testing (IE, Safari, mobile browsers - maybe saucelabs?)
- update node-sdk to use current version of this lib's RecognizeStream (and also provide the FormatStream + anything else that might be handy)
- move
resultandresultsevents to node wrapper (along with the deprecation notice) - improve docs
- consider a wrapper to match https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
- support a "hard" stop that prevents any further data events, even for already uploaded audio, ensure timing stream also implements this.
- look for bug where single-word final results may omit word confidence (possibly due to FormatStream?)
- fix bug where TimingStream shows words slightly before they're spoken
- support jquery objects for element and targetElement