# Using Vosk grammars

- [Sentence-based speech-to-text, specifyng a grammar](#sentence-based-speech-to-text-specifyng-a-grammar)
- [💡 Stateful & low latency ASR. Proposed architecture](#-stateful--low-latency-asr-proposed-architecture)


## Sentence-based speech-to-text, specifyng a grammar   

[`grammar.js`](grammar.js) is a basic demo using Vosk recognizer using a specified grammar.
The output structure format now allows dofferent alternatives)

```bash
node grammar
```
```
$ node grammar.js
model directory      : ../models/vosk-model-small-en-us-0.15
speech file name     : ../audio/2830-3980-0043.wav
grammar              : experience proves this,why should one hold on the way,your power is sufficient i said,oh one two three four five six seven eight nine zero,[unk]
load model latency   : 328ms
{
  alternatives: [
    {
      confidence: 197.583099,
      result: [
        { end: 1.02, start: 0.36, word: 'experience' },
        { end: 1.35, start: 1.02, word: 'proves' },
        { end: 1.98, start: 1.35, word: 'this' }
      ],
      text: ' experience proves this'
    }
  ]
}
transcript latency : 118ms
```

IMPORTANT: 
**latency is very low if grammar sentences are provided!**

See details here: 
- https://github.com/alphacep/vosk-api/blob/master/nodejs/index.js#L198
- https://github.com/alphacep/vosk-api/blob/91a128b3edf7e84d55649d8fa9a60664b5386292/src/vosk_api.h#L114
- https://github.com/alphacep/vosk-api/issues/500

That's not an issue, just a question/discussion for you/everyone about the proposed architecture.

Preamble about latencies
Vosk decoding latencies time are very fast! On my PC, for short (few words) utterances transcripts I got:

1. Using grammar-based models (e.g. pretrained model model-small-en-us-0.15) 
   - If I DO NOT specify any grammar I achieve latency of ~500-600 msecs 
   - If I DO specify a grammar (also pretty long) I achieve few tents of msecs ( `<<` 100 msecs)
2. Using large / static graph model (e.g. vosk-model-en-us-aspire-0.2), I got ~400-500 msec latency (with a better accuracy for open-domain utterances). 


## 💡 Stateful & low latency ASR. Proposed architecture

Considering a stateful (task-oriented closed-domain) voice-assistant platform, I want to experiment how much can I slow-down latencies, with a stateful ASR. My idea is to connect Vosk ASR with a state-based dialog manager (as my own opensource [NaifJs](https://github.com/solyarisoftware/naifjs)),

Workflow:

1. Initialization phase:
   - to load model that allow grammars (e.g. model model-small-en-us-0.15) 
   -  to prepare/create N different Vosk Recognizers for each `grammar(N) ` (one grammar for for each `state(N)` ) 

 2. Run-time (decoding time)
    - a "Decode Manager" decides which Recognizer us to be used, depending on the state injected by the dialog manager
    - The Decode Manager could use a fallback Recognizer, based on the original model, without a grammar specified for a final decision

See the diagram:

```
                       state(S-1) -> grammar(S-1)
                      ┌────────────────────────────────────────────────────────────┐
                      │                                                            │
                      │                                                            │
                      │                                                            │
                      │       (1)                                                  │
           ┌──────────▼─────────┐                                                  │
           │                    │                                                  │
           │                    │                                (2)               │
           │                    │   ┌──────────────┐   ┌───────────┐               │
           │                    │   │              │   │           │               │
           │                    │   │ Grammar 1    │   │           │               │
           │                    ◄───┤ Recognizer 1 ◄───┤           │               │
           │                    │   │              │   │           │               │   (3)
           │                    │   │              │   │           │         ┌─────┴─────┐
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ Grammar 2    │   │           │         │           │
           │                    ◄───┤ Recognizer 2 ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
pcm audio  │       DECODER      │                      │  MODEL    │         │  DIALOG   │
───────────►       MANAGER      │   ┌──────────────┐   │  ALLOWING │         │  MANAGER  ├───────►
           │                    │   │              │   │  GRAMMARS │         │           │
           │                    │   │ Grammar N    │   │           │         │           │
           │                    ◄───┤ Recognizer N ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ No-Grammar   │   │           │         └─────▲─────┘
           │                    ◄───┤ Recognizer 0 ◄───┤           │               │
           │                    │   │              │   │           │               │
           │                    │   │              │   │           │               │
           │ ┌────────────────┐ │   └──────────────┘   └───────────┘               │
           │ │ acceptWaveForm │ │                                                  │
           │ │                │ │                                                  │
           │ └───────┬────────┘ │                                                  │
           │         │          │                                                  │
           │         │          │                                                  │
           └─────────┼──────────┘                                                  │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     └─────────────────────────────────────────────────────────────┘
                     decode result S
```

That approach would minimize `new Recognizer` elapsed, even if I noticed this partial latency is really low (few msecs) when a grammar is specified, 
whereas it increases to many tents of msecs if a grammar is NOT specified.

See also: https://github.com/alphacep/vosk-api/issues/553


---

[top](#) | [back](README.md) | [home](../README.md)