# @steelbrain/media-speech-detection-web

Speech Detection using Silero VAD ONNX model for web browsers.

## Installation

```bash
npm install @steelbrain/media-speech-detection-web
```

**Modern Bundler Support**: This package is fully compatible with modern bundlers (Webpack 5, Next.js, Vite, etc.). The ONNX model file is automatically detected and bundled - no manual setup or public folder configuration required.

## Quick Start

```typescript
import { speechFilter, preloadModel } from '@steelbrain/media-speech-detection-web';
import { ingestAudioStream, RECOMMENDED_AUDIO_CONSTRAINTS } from '@steelbrain/media-ingest-audio';

// Optional: Preload model during app initialization for faster first use
await preloadModel();

// Get microphone access
const mediaStream = await navigator.mediaDevices.getUserMedia({
  audio: RECOMMENDED_AUDIO_CONSTRAINTS
});

// Create 16kHz audio stream
const audioStream = await ingestAudioStream(mediaStream);

// Filter audio to only speech chunks
const vadTransform = speechFilter({
  onSpeechStart: () => console.log('🎤 Speech started'),
  onSpeechEnd: () => console.log('🔇 Speech ended'),
  threshold: 0.5
});

await audioStream
  .pipeThrough(vadTransform)
  .pipeTo(speechProcessor);

// Events-only (no audio output) using .tee() pattern
const [processStream, eventsStream] = audioStream.tee();

// Process audio on one branch
processStream.pipeTo(speechProcessor);

// Handle events on another branch without outputting audio
eventsStream.pipeThrough(speechFilter({
  noEmit: true,  // Don't emit audio chunks
  onSpeechStart: () => console.log('🎤 Speech started'),
  onSpeechEnd: () => console.log('🔇 Speech ended'),
  onMisfire: () => console.log('⚠️ Short speech segment filtered')
}));
```

## API Reference

### `preloadModel(): Promise<void>`

Preloads the Silero VAD ONNX model by fetching it into browser cache, eliminating network delay when speech detection is first used.

**Usage**: `await preloadModel()` - Call during app initialization for optimal performance.

### `speechFilter(options): TransformStream<Float32Array, Float32Array>`

Creates a TransformStream that filters audio, outputting only speech chunks. Use the `noEmit` option for events-only processing.

**Usage**: `audioStream.pipeThrough(speechFilter(options)).pipeTo(processor)`

### Configuration Options

```typescript
interface VADOptions {
  // Event Handlers
  onSpeechStart?: () => void;
  onSpeechEnd?: (speechAudio: Float32Array) => void;
  onMisfire?: () => void;
  onError?: (error: Error) => void;
  onDebugLog?: (message: string) => void;

  // Detection Configuration
  threshold?: number;              // Speech detection threshold (0-1). Default: 0.5
  minSpeechDurationMs?: number;    // Minimum speech duration in ms. Default: 160ms
  redemptionDurationMs?: number;   // Grace period before confirming speech end. Default: 400ms
  lookBackDurationMs?: number;     // Lookback buffer for smooth speech start. Default: 384ms
  
  // Stream Control
  noEmit?: boolean;               // Don't emit chunks, only trigger callbacks. Default: false
}
```

### Optimal Defaults

The package provides carefully tuned defaults that work well for most use cases:

| Parameter | Default | Purpose |
|-----------|---------|---------|
| `threshold` | `0.5` | Balanced speech detection |
| `minSpeechDurationMs` | `160ms` | Filters out very short sounds |
| `redemptionDurationMs` | `400ms` | Handles natural speech pauses |
| `lookBackDurationMs` | `384ms` | Captures natural audio context before speech |

## Advanced Usage

### Error Handling & Debugging

```typescript
const vadTransform = speechFilter({
  onSpeechStart: () => console.log('🎤 Speech started'),
  onSpeechEnd: () => console.log('🔇 Speech ended'),
  onError: (error) => console.error('VAD Error:', error),
  onDebugLog: (message) => console.log('VAD Debug:', message),
  threshold: 0.6
});
```

### Real-time Speech Transcription Pipeline

```typescript
// Preload model during app startup
await preloadModel();

// Complete pipeline: microphone → VAD → transcription
await audioStream
  .pipeThrough(speechFilter({
    onSpeechStart: () => showRecordingIndicator(),
    onSpeechEnd: () => hideRecordingIndicator(),
    threshold: 0.5
  }))
  .pipeThrough(transcriptionTransform)
  .pipeTo(displayResults);
```

### Performance Optimization

```typescript
// Preload model early in your application lifecycle
window.addEventListener('load', async () => {
  try {
    await preloadModel();
    console.log('VAD model preloaded and cached');
  } catch (error) {
    console.warn('Failed to preload VAD model:', error);
  }
});
```

## How It Works

1. **Silero VAD Model**: Uses the pre-trained Silero VAD ONNX model for production-ready accuracy
2. **Audio Processing**: Processes 16kHz mono audio in 512-sample windows (32ms frames)
3. **State Machine**: Implements a sophisticated state machine with speech/intermediate/silent states
4. **Lookback Buffer**: Maintains a buffer to capture speech starts smoothly
5. **Temporal Smoothing**: Uses configurable timing thresholds to prevent false triggers
6. **Web Streams**: Built on modern Web Streams API for optimal performance and composability

## Model Details

- **Model**: [Silero VAD v4.0](https://github.com/snakers4/silero-vad) (MIT License)
- **Input**: 16kHz mono audio, 512 samples per inference (32ms windows)
- **Output**: Speech probability (0-1) per window + internal LSTM state
- **Model Size**: ~2.3MB ONNX format
- **Performance**: <1ms inference time per chunk on modern browsers
- **Accuracy**: Enterprise-grade performance across diverse acoustic conditions

## Credits

This package uses the [Silero VAD](https://github.com/snakers4/silero-vad) model developed by Silero Team, licensed under MIT License. The model provides state-of-the-art speech detection with excellent performance across various languages and acoustic conditions.

## License

MIT License - See LICENSE file for details.

**Silero VAD Model**: MIT License (© Silero Team)