# react-native-sherpa-onnx-offline-stt

A React Native library for **offline speech-to-text** using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx). Runs entirely on-device with no internet connection required.

## Features

- **Offline STT** - Speech recognition runs locally on the device
- **Two modes**: Streaming (real-time) and Offline (VAD-triggered batch processing)
- **TEN-VAD** - Voice Activity Detection for accurate speech segmentation
- **Speaker Diarization** - Identify different speakers in conversation
- **Speech Denoising** - GTCRN-based noise reduction
- **Background Recording** - Continue recording when app is minimized
- **Performance Metrics** - RTFx, processing time, confidence scores
- **Streaming State** - Two-tier volatile/confirmed transcript updates

## Installation

```bash
npm install react-native-sherpa-onnx-offline-stt
# or
yarn add react-native-sherpa-onnx-offline-stt
```

### Android

Add to your `android/app/build.gradle`:

```gradle
android {
    packagingOptions {
        pickFirst '**/*.so'
    }
}

dependencies {
    implementation 'com.k2fsa.sherpa:sherpa-onnx:1.10.+'
}
```

### iOS

```bash
cd ios && pod install
```

## Models

You need to download the models separately and place them on the device:

### STT Models

- **Streaming**: [Zipformer French](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-fr-2023-04-14.tar.bz2) (~128MB)
- **Offline**: [Parakeet TDT v3](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2) (~670MB)

### VAD Model

- [TEN-VAD](https://github.com/ten-framework/TEN-VAD) - Included in the library

### Speaker Diarization (Optional)

- [3D-Speaker](https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx) (~26MB)

### Denoiser (Optional)

- [GTCRN](https://github.com/k2-fsa/sherpa-onnx/releases/download/speech-enhancement-models/gtcrn_simple.onnx) (~524KB)

## Usage

```typescript
import STTManager from 'react-native-sherpa-onnx-offline-stt';
import type { STTResult, VADEvent, SpeakerEvent } from 'react-native-sherpa-onnx-offline-stt';

// Create manager instance
const sttManager = new STTManager();

// Initialize with configuration
await sttManager.initialize({
  modelPath: '/path/to/stt-model',
  tokensPath: '/path/to/tokens.txt',
  modelType: 'offline', // or 'streaming'
  vadModelPath: '/path/to/vad-model',
  sampleRate: 16000,

  // Structured VAD configuration
  vad: {
    threshold: 0.5,
    minSpeechDurationMs: 300,
    minSilenceDurationMs: 500,
    maxSpeechDurationMs: 30000,  // Force segment break after 30s
    speechPaddingMs: 100,
    mode: 'normal',  // 'aggressive' | 'normal' | 'sensitive'
  },

  // Optional features
  diarizationModelPath: '/path/to/speaker-model.onnx',
  diarizationThreshold: 0.55,
  denoiserModelPath: '/path/to/gtcrn_simple.onnx',
});

// Subscribe to events using chainable API
sttManager
  .on('transcript', (result: STTResult) => {
    console.log(`[Speaker ${result.speakerId}]: ${result.text}`);
    console.log(`RTFx: ${result.rtfx}, Processing: ${result.processingTime}s`);
  })
  .on('streaming', (update) => {
    // Two-tier streaming state
    console.log('Confirmed:', update.confirmed);  // Stable text
    console.log('Volatile:', update.volatile);    // May change
  })
  .on('vad', (event: VADEvent) => {
    console.log(`VAD: ${event.state}`);
  })
  .on('speaker', (event: SpeakerEvent) => {
    console.log(`Speaker ${event.speakerId} (${event.status})`);
  })
  .on('error', (error) => {
    console.error(`Error: ${error.code} - ${error.message}`);
  });

// Start recording
await sttManager.startRecording();

// Stop recording
const results = await sttManager.stopRecording();

// Clean up
await sttManager.deinitialize();
```

## API Reference

### STTManager Class

```typescript
const manager = new STTManager();
```

#### Properties

| Property | Type | Description |
|----------|------|-------------|
| `initialized` | `boolean` | Whether the engine is initialized |
| `recording` | `boolean` | Whether currently recording |

#### Methods

| Method | Returns | Description |
|--------|---------|-------------|
| `initialize(config)` | `Promise<void>` | Initialize STT engine |
| `startRecording()` | `Promise<void>` | Start microphone recording |
| `stopRecording()` | `Promise<STTResult[]>` | Stop and get final results |
| `recognizeFile(path)` | `Promise<STTResult[]>` | Transcribe audio file |
| `isRecordingAsync()` | `Promise<boolean>` | Check recording status |
| `getModelType()` | `Promise<ModelType>` | Get current mode |
| `getSpeakerCount()` | `Promise<number>` | Get detected speakers |
| `resetSpeakers()` | `Promise<void>` | Clear speaker profiles |
| `setDenoiserEnabled(bool)` | `Promise<boolean>` | Toggle denoiser |
| `isDenoiserEnabled()` | `Promise<boolean>` | Check denoiser status |
| `startBackgroundService()` | `Promise<boolean>` | Enable background recording |
| `stopBackgroundService()` | `Promise<boolean>` | Disable background recording |
| `deinitialize()` | `Promise<void>` | Clean up resources |
| `on(event, callback)` | `this` | Subscribe to events (chainable) |
| `off(event)` | `this` | Unsubscribe from events (chainable) |

#### Static Methods

| Method | Returns | Description |
|--------|---------|-------------|
| `STTManager.getAvailableProviders()` | `Promise<DeviceProvidersInfo>` | Get available ONNX providers |
| `STTManager.platform` | `string` | Current platform ('ios' or 'android') |

### Configuration

```typescript
interface STTConfig {
  // Required
  modelPath: string;
  tokensPath: string;
  vadModelPath: string;

  // STT mode
  modelType?: 'streaming' | 'offline';  // Default: 'streaming'

  // VAD configuration
  vad: {
    threshold: number;              // 0.5 - Speech detection sensitivity
    minSpeechDurationMs: number;    // 300 - Min speech to trigger
    minSilenceDurationMs: number;   // 500 - Silence to end segment
    maxSpeechDurationMs: number;    // 30000 - Force break long speech
    speechPaddingMs: number;        // 100 - Padding around segments
    mode: 'aggressive' | 'normal' | 'sensitive';
  };

  // Audio
  sampleRate?: number;  // Default: 16000

  // Speaker diarization (optional)
  diarizationModelPath?: string;
  diarizationThreshold?: number;     // Default: 0.45
  diarizationMinSpeechMs?: number;   // Default: 800

  // Denoiser (optional)
  denoiserModelPath?: string;

  // ONNX provider
  provider?: 'cpu' | 'nnapi' | 'gpu' | 'coreml';  // Default: 'cpu'
}
```

### Events

#### transcript

```typescript
interface STTResult {
  text: string;
  isFinal: boolean;
  startTime: number;
  endTime: number;

  // Performance metrics
  confidence: number;      // 0-1 recognition confidence
  processingTime: number;  // Seconds to process
  audioDuration: number;   // Audio length in seconds
  rtfx: number;            // Real-time factor (>1 = faster than real-time)

  // Speaker info
  speakerId?: number;
  speakerStatus?: 'pending' | 'confirmed';
}
```

#### streaming

Two-tier transcript state for smoother UX:

```typescript
interface StreamingTranscriptUpdate {
  volatile: string;      // Current hypothesis (may change)
  confirmed: string;     // Stable text (won't change)
  fullText: string;      // confirmed + volatile
  isFinal: boolean;
  confidence: number;
  processingTime: number;
  rtfx: number;
}
```

#### vad

```typescript
interface VADEvent {
  state: 'silence' | 'speech_start' | 'speech' | 'speech_end';
  speechProbability: number;
  speechDurationMs: number;
  silenceDurationMs: number;
}
```

#### speaker

```typescript
interface SpeakerEvent {
  speakerId: number;
  status: 'pending' | 'confirmed';
  justConfirmed: boolean;
  totalSpeakers: number;
}
```

#### error

```typescript
interface STTError {
  code: string;
  message: string;
}
```

### VAD Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| `aggressive` | Less sensitive, fewer false positives | Noisy environments |
| `normal` | Balanced sensitivity | General use |
| `sensitive` | More sensitive, catches quieter speech | Quiet environments |

## Streaming vs Offline Mode

| Feature | Streaming | Offline |
|---------|-----------|---------|
| Latency | Real-time partial results | Results after speech ends |
| Accuracy | Good | Better |
| Use case | Live captions | Meeting transcription |
| Models | Zipformer | Parakeet, Whisper |

## Background Recording

To record when the app is in background:

```typescript
// Before starting recording
await sttManager.startBackgroundService();
await sttManager.startRecording();

// When done
await sttManager.stopRecording();
await sttManager.stopBackgroundService();
```

This shows a notification while recording in background.

## Provider Detection

Check available ONNX providers on the device:

```typescript
const info = await STTManager.getAvailableProviders();
console.log(`Device: ${info.manufacturer} ${info.device}`);
console.log(`Recommended: ${info.recommended}`);
console.log('Available:', info.providers.filter(p => p.available).map(p => p.name));
```

## Platform Support

| Platform | Status |
|----------|--------|
| Android | Full support |
| iOS | Full support |

## License

MIT

## Credits

- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) - Speech recognition engine
- [TEN-VAD](https://github.com/ten-framework/TEN-VAD) - Voice activity detection
- [GTCRN](https://github.com/Xiaobin-Rong/gtcrn) - Speech enhancement