# llama-cpp Capacitor Plugin

[![Actions Status](https://github.com/arusatech/llama-cpp/workflows/CI/badge.svg)](https://github.com/arusatech/llama-cpp/actions)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![npm](https://img.shields.io/npm/v/llama-cpp-capacitor.svg)](https://www.npmjs.com/package/llama-cpp-capacitor/)

A native Capacitor plugin that embeds [llama.cpp](https://github.com/ggerganov/llama.cpp) directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.

[llama.cpp](https://github.com/ggerganov/llama.cpp): Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

## 🚀 Features

- **Offline AI Inference**: Run large language models completely offline on mobile devices
- **Text Generation**: Complete text completion with streaming support
- **Chat Conversations**: Multi-turn conversations with context management
- **Multimodal Support**: Process images and audio alongside text
- **Text-to-Speech (TTS)**: Generate speech from text using vocoder models
- **LoRA Adapters**: Fine-tune models with LoRA adapters
- **Embeddings**: Generate vector embeddings for semantic search
- **Reranking**: Rank documents by relevance to queries
- **Session Management**: Save and load conversation states
- **Benchmarking**: Performance testing and optimization tools
- **Structured Output**: Generate JSON with schema validation
- **Cross-Platform**: iOS and Android support with native optimizations

## ✅ **Complete Implementation Status**

This plugin is now **FULLY IMPLEMENTED** with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:

### **Completed Features**
- **Complete C++ Integration**: Full llama.cpp library integration with all core components
- **Native Build System**: CMake-based build system for both iOS and Android
- **Platform Support**: iOS (arm64, x86_64) and Android (arm64-v8a, armeabi-v7a, x86, x86_64)
- **TypeScript API**: Complete TypeScript interface matching llama.rn functionality
- **Native Methods**: All 30+ native methods implemented with proper error handling
- **Event System**: Capacitor event system for progress and token streaming
- **Documentation**: Comprehensive README and API documentation

### **Technical Implementation**
- **C++ Core**: Complete llama.cpp library with GGML, GGUF, and all supporting components
- **iOS Framework**: Native iOS framework with Metal acceleration support
- **Android JNI**: Complete JNI implementation with multi-architecture support
- **Build Scripts**: Automated build system for both platforms
- **Error Handling**: Robust error handling and result types

### **Project Structure**
```
llama-cpp/
├── cpp/                    # Complete llama.cpp C++ library
│   ├── ggml.c             # GGML core
│   ├── gguf.cpp           # GGUF format support
│   ├── llama.cpp          # Main llama.cpp implementation
│   ├── rn-llama.cpp       # React Native wrapper (adapted)
│   ├── rn-completion.cpp  # Completion handling
│   ├── rn-tts.cpp         # Text-to-speech
│   └── tools/mtmd/        # Multimodal support
├── ios/
│   ├── CMakeLists.txt     # iOS build configuration
│   └── Sources/           # Swift implementation
├── android/
│   ├── src/main/
│   │   ├── CMakeLists.txt # Android build configuration
│   │   ├── jni.cpp        # JNI implementation
│   │   └── jni-utils.h    # JNI utilities
│   └── build.gradle       # Android build config
├── src/
│   ├── definitions.ts     # Complete TypeScript interfaces
│   ├── index.ts           # Main plugin implementation
│   └── web.ts             # Web fallback
└── build-native.sh        # Automated build script
```

## 📦 Installation

```sh
npm install llama-cpp-capacitor
```

## 🔨 **Building the Native Library**

The plugin includes a complete native implementation of llama.cpp. To build the native libraries:

### **Prerequisites**

- **CMake** (3.16+ for iOS, 3.10+ for Android)
- **Xcode** (for iOS builds, macOS only)
- **Android Studio** with NDK (for Android builds)
- **Make** or **Ninja** build system

### **Automated Build**

```bash
# Build for all platforms
npm run build:native

# Build for specific platforms
npm run build:ios      # iOS only
npm run build:android  # Android only

# Clean native builds
npm run clean:native
```

### **Manual Build**

#### **iOS Build**
```bash
cd ios
cmake -B build -S .
cmake --build build --config Release
```

#### **Android Build**
```bash
cd android
./gradlew assembleRelease
```

### **Build Output**

- **iOS**: `ios/build/LlamaCpp.framework/`
- **Android**: `android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so`

### **Updating the native source (e.g. for vision model support)**

The native `cpp/` layer is based on [llama.cpp](https://github.com/ggerganov/llama.cpp). To pull in a newer upstream version (e.g. for vision model support) **without overwriting** the Capacitor adapter code, use the **bootstrap script** (included in this repo):

```bash
./scripts/bootstrap.sh [branch-or-tag-or-commit]
# Example: ./scripts/bootstrap.sh master
```

This syncs upstream into `cpp/` and keeps project-specific files (`cap-*.cpp/h`, `tools/mtmd/`, etc.) intact. After running it, reconcile any API changes in the adapter code, then rebuild with `npm run build:native` or `./build-native.sh`. See [cpp/README.md](cpp/README.md) and [docs/IOS_IMPLEMENTATION_GUIDE.md](docs/IOS_IMPLEMENTATION_GUIDE.md).

### **iOS implementation (step-by-step)**

For a **step-by-step guide** on how methods are implemented on the iOS side (Swift bridge → native framework, adding/updating C symbols, and updating the native layer for vision), see **[docs/IOS_IMPLEMENTATION_GUIDE.md](docs/IOS_IMPLEMENTATION_GUIDE.md)**.

### iOS Setup

1. Install the plugin:
```sh
npm install llama-cpp
```

2. Add to your iOS project:
```sh
npx cap add ios
npx cap sync ios
```

3. Open the project in Xcode:
```sh
npx cap open ios
```

### Android Setup

1. Install the plugin:
```sh
npm install llama-cpp
```

2. Add to your Android project:
```sh
npx cap add android
npx cap sync android
```

3. Open the project in Android Studio:
```sh
npx cap open android
```

## 🎯 Quick Start

### Basic Text Completion

```typescript
import { initLlama } from 'llama-cpp';

// Initialize a model
const context = await initLlama({
  model: '/path/to/your/model.gguf',
  n_ctx: 2048,
  n_threads: 4,
  n_gpu_layers: 0,
});

// Generate text
const result = await context.completion({
  prompt: "Hello, how are you today?",
  n_predict: 50,
  temperature: 0.8,
});

console.log('Generated text:', result.text);
```

### Chat-Style Conversations

```typescript
const result = await context.completion({
  messages: [
    { role: "system", content: "You are a helpful AI assistant." },
    { role: "user", content: "What is the capital of France?" },
    { role: "assistant", content: "The capital of France is Paris." },
    { role: "user", content: "Tell me more about it." }
  ],
  n_predict: 100,
  temperature: 0.7,
});

console.log('Chat response:', result.content);
```

### Streaming Completion

```typescript
let fullText = '';
const result = await context.completion({
  prompt: "Write a short story about a robot learning to paint:",
  n_predict: 150,
  temperature: 0.8,
}, (tokenData) => {
  // Called for each token as it's generated
  fullText += tokenData.token;
  console.log('Token:', tokenData.token);
});

console.log('Final result:', result.text);
```

## 🚀 **Mobile-Optimized Speculative Decoding**

**Achieve 2-8x faster inference with significantly reduced battery consumption!**

Speculative decoding uses a smaller "draft" model to predict multiple tokens ahead, which are then verified by the main model. This results in dramatic speedups with identical output quality.

### Basic Usage

```typescript
import { initLlama } from 'llama-cpp-capacitor';

// Initialize with speculative decoding
const context = await initLlama({
  model: '/path/to/your/main-model.gguf',         // Main model (e.g., 7B)
  draft_model: '/path/to/your/draft-model.gguf', // Draft model (e.g., 1.5B)
  
  // Speculative decoding parameters
  speculative_samples: 3,      // Number of tokens to predict speculatively
  mobile_speculative: true,    // Enable mobile optimizations
  
  // Standard parameters
  n_ctx: 2048,
  n_threads: 4,
});

// Use normally - speculative decoding is automatic
const result = await context.completion({
  prompt: "Write a story about AI:",
  n_predict: 200,
  temperature: 0.7,
});

console.log('🚀 Generated with speculative decoding:', result.text);
```

### Mobile-Optimized Configuration

```typescript
// Recommended mobile setup for best performance/battery balance
const mobileContext = await initLlama({
  // Quantized models for mobile efficiency
  model: '/models/llama-2-7b-chat.q4_0.gguf',
  draft_model: '/models/tinyllama-1.1b-chat.q4_0.gguf',
  
  // Conservative mobile settings
  n_ctx: 1024,                 // Smaller context for mobile
  n_threads: 3,                // Conservative threading
  n_batch: 64,                 // Smaller batch size
  n_gpu_layers: 24,            // Utilize mobile GPU
  
  // Optimized speculative decoding
  speculative_samples: 3,      // 2-3 tokens ideal for mobile
  mobile_speculative: true,    // Enables mobile-specific optimizations
  
  // Memory optimizations
  use_mmap: true,              // Memory mapping for efficiency
  use_mlock: false,            // Don't lock memory on mobile
});
```

### Performance Benefits

- **2-8x faster inference** - Dramatically reduced time to generate text
- **50-80% battery savings** - Less time computing = longer battery life
- **Identical output quality** - Same text quality as regular decoding
- **Automatic fallback** - Falls back to regular decoding if draft model fails
- **Mobile optimized** - Specifically tuned for mobile device constraints

### Model Recommendations

| Model Type | Recommended Size | Quantization | Example |
|------------|------------------|--------------|---------|
| **Main Model** | 3-7B parameters | Q4_0 or Q4_1 | `llama-2-7b-chat.q4_0.gguf` |
| **Draft Model** | 1-1.5B parameters | Q4_0 | `tinyllama-1.1b-chat.q4_0.gguf` |

### Error Handling & Fallback

```typescript
// Robust setup with automatic fallback
try {
  const context = await initLlama({
    model: '/models/main-model.gguf',
    draft_model: '/models/draft-model.gguf',
    speculative_samples: 3,
    mobile_speculative: true,
  });
  console.log('✅ Speculative decoding enabled');
} catch (error) {
  console.warn('⚠️ Falling back to regular decoding');
  const context = await initLlama({
    model: '/models/main-model.gguf',
    // No draft_model = regular decoding
  });
}
```

## 📚 API Reference

### Core Functions

#### `initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>`

Initialize a new llama.cpp context with a model.

**Parameters:**
- `params`: Context initialization parameters
- `onProgress`: Optional progress callback (0-100)

**Returns:** Promise resolving to a `LlamaContext` instance

#### `releaseAllLlama(): Promise<void>`

Release all contexts and free memory.

#### `toggleNativeLog(enabled: boolean): Promise<void>`

Enable or disable native logging.

#### `addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }`

Add a listener for native log messages.

### LlamaContext Class

#### `completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>`

Generate text completion.

**Parameters:**
- `params`: Completion parameters including prompt or messages
- `callback`: Optional callback for token-by-token streaming

#### `tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>`

Tokenize text or text with images.

#### `detokenize(tokens: number[]): Promise<string>`

Convert tokens back to text.

#### `embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>`

Generate embeddings for text.

#### `rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>`

Rank documents by relevance to a query.

#### `bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>`

Benchmark model performance.

### Multimodal Support

#### `initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>`

Initialize multimodal support with a projector file.

#### `isMultimodalEnabled(): Promise<boolean>`

Check if multimodal support is enabled.

#### `getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>`

Get multimodal capabilities.

#### `releaseMultimodal(): Promise<void>`

Release multimodal resources.

### TTS (Text-to-Speech)

#### `initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>`

Initialize TTS with a vocoder model.

#### `isVocoderEnabled(): Promise<boolean>`

Check if TTS is enabled.

#### `getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>`

Get formatted audio completion prompt.

#### `getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>`

Get guide tokens for audio completion.

#### `decodeAudioTokens(tokens: number[]): Promise<Array<number>>`

Decode audio tokens to audio data.

#### `releaseVocoder(): Promise<void>`

Release TTS resources.

### LoRA Adapters

#### `applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>`

Apply LoRA adapters to the model.

#### `removeLoraAdapters(): Promise<void>`

Remove all LoRA adapters.

#### `getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>`

Get list of loaded LoRA adapters.

### Session Management

#### `saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>`

Save current session to a file.

#### `loadSession(filepath: string): Promise<NativeSessionLoadResult>`

Load session from a file.

## 🔧 Configuration

### Context Parameters

```typescript
interface ContextParams {
  model: string;                    // Path to GGUF model file
  n_ctx?: number;                   // Context size (default: 512)
  n_threads?: number;               // Number of threads (default: 4)
  n_gpu_layers?: number;            // GPU layers (iOS only)
  use_mlock?: boolean;              // Lock memory (default: false)
  use_mmap?: boolean;               // Use memory mapping (default: true)
  embedding?: boolean;              // Embedding mode (default: false)
  cache_type_k?: string;            // KV cache type for K
  cache_type_v?: string;            // KV cache type for V
  pooling_type?: string;            // Pooling type
  // ... more parameters
}
```

### Completion Parameters

```typescript
interface CompletionParams {
  prompt?: string;                  // Text prompt
  messages?: Message[];             // Chat messages
  n_predict?: number;               // Max tokens to generate
  temperature?: number;             // Sampling temperature
  top_p?: number;                   // Top-p sampling
  top_k?: number;                   // Top-k sampling
  stop?: string[];                  // Stop sequences
  // ... more parameters
}
```

## 📱 Platform Support

| Feature | iOS | Android | Web |
|---------|-----|---------|-----|
| Text Generation | ✅ | ✅ | ❌ |
| Chat Conversations | ✅ | ✅ | ❌ |
| Streaming | ✅ | ✅ | ❌ |
| Multimodal | ✅ | ✅ | ❌ |
| TTS | ✅ | ✅ | ❌ |
| LoRA Adapters | ✅ | ✅ | ❌ |
| Embeddings | ✅ | ✅ | ❌ |
| Reranking | ✅ | ✅ | ❌ |
| Session Management | ✅ | ✅ | ❌ |
| Benchmarking | ✅ | ✅ | ❌ |

## 🎨 Advanced Examples

### Multimodal Processing

```typescript
// Initialize multimodal support
await context.initMultimodal({
  path: '/path/to/mmproj.gguf',
  use_gpu: true,
});

// Process image with text
const result = await context.completion({
  messages: [
    { 
      role: "user", 
      content: [
        { type: "text", text: "What do you see in this image?" },
        { type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
      ]
    }
  ],
  n_predict: 100,
});

console.log('Image analysis:', result.content);
```

### Text-to-Speech

```typescript
// Initialize TTS
await context.initVocoder({
  path: '/path/to/vocoder.gguf',
  n_batch: 512,
});

// Generate audio
const audioCompletion = await context.getFormattedAudioCompletion(
  null, // Speaker configuration
  "Hello, this is a test of text-to-speech functionality."
);

const guideTokens = await context.getAudioCompletionGuideTokens(
  "Hello, this is a test of text-to-speech functionality."
);

const audioResult = await context.completion({
  prompt: audioCompletion.prompt,
  grammar: audioCompletion.grammar,
  guide_tokens: guideTokens,
  n_predict: 1000,
});

const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);
```

### LoRA Adapters

```typescript
// Apply LoRA adapters
await context.applyLoraAdapters([
  { path: '/path/to/adapter1.gguf', scaled: 1.0 },
  { path: '/path/to/adapter2.gguf', scaled: 0.5 }
]);

// Check loaded adapters
const adapters = await context.getLoadedLoraAdapters();
console.log('Loaded adapters:', adapters);

// Generate with adapters
const result = await context.completion({
  prompt: "Test prompt with LoRA adapters:",
  n_predict: 50,
});

// Remove adapters
await context.removeLoraAdapters();
```

### Structured Output

#### JSON Schema (Auto-converted to GBNF)
```typescript
const result = await context.completion({
  prompt: "Generate a JSON object with a person's name, age, and favorite color:",
  n_predict: 100,
  response_format: {
    type: 'json_schema',
    json_schema: {
      strict: true,
      schema: {
        type: 'object',
        properties: {
          name: { type: 'string' },
          age: { type: 'number' },
          favorite_color: { type: 'string' }
        },
        required: ['name', 'age', 'favorite_color']
      }
    }
  }
});

console.log('Structured output:', result.content);
```

#### Direct GBNF Grammar
```typescript
// Define GBNF grammar directly for maximum control
const grammar = `
root ::= "{" ws name_field "," ws age_field "," ws color_field "}"
name_field ::= "\\"name\\"" ws ":" ws string_value
age_field ::= "\\"age\\"" ws ":" ws number_value  
color_field ::= "\\"favorite_color\\"" ws ":" ws string_value
string_value ::= "\\"" [a-zA-Z ]+ "\\""
number_value ::= [0-9]+
ws ::= [ \\t\\n]*
`;

const result = await context.completion({
  prompt: "Generate a person's profile:",
  grammar: grammar,
  n_predict: 100
});

console.log('Grammar-constrained output:', result.text);
```

#### Manual JSON Schema to GBNF Conversion
```typescript
import { convertJsonSchemaToGrammar } from 'llama-cpp-capacitor';

const schema = {
  type: 'object',
  properties: {
    name: { type: 'string' },
    age: { type: 'number' }
  },
  required: ['name', 'age']
};

// Convert schema to GBNF grammar
const grammar = await convertJsonSchemaToGrammar(schema);
console.log('Generated grammar:', grammar);

const result = await context.completion({
  prompt: "Generate a person:",
  grammar: grammar,
  n_predict: 100
});
```

## 🔍 Model Compatibility

This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.

### Recommended Models

- **Llama 2**: Meta's latest language model
- **Mistral**: High-performance open model
- **Code Llama**: Specialized for code generation
- **Phi-2**: Microsoft's efficient model
- **Gemma**: Google's open model

### Model Quantization

For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.

## ⚡ Performance Considerations

### Memory Management

- Use quantized models for better memory efficiency
- Adjust `n_ctx` based on your use case
- Monitor memory usage with `use_mlock: false`

### GPU Acceleration

- iOS: Set `n_gpu_layers` to use Metal GPU acceleration
- Android: GPU acceleration is automatically enabled when available

### Threading

- Adjust `n_threads` based on device capabilities
- More threads may improve performance but increase memory usage

## 🐛 Troubleshooting

### Common Issues

1. **Model not found**: Ensure the model path is correct and the file exists
2. **Out of memory**: Try using a quantized model or reducing `n_ctx`
3. **Slow performance**: Enable GPU acceleration or increase `n_threads`
4. **Multimodal not working**: Ensure the mmproj file is compatible with your model

### Debugging

Enable native logging to see detailed information:

```typescript
import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';

await toggleNativeLog(true);

const logListener = addNativeLogListener((level, text) => {
  console.log(`[${level}] ${text}`);
});
```

## 📦 Publishing

To publish the package to npm:

1. **Build** (runs automatically on `npm publish` via `prepublishOnly`): `npm run build` — produces `dist/` (plugin bundles, ESM, docs).
2. **Optional — include native libs in the tarball**: `npm run build:all` (requires macOS/NDK) — builds iOS framework and Android `.so` into `ios/Frameworks` and `android/src/main/jniLibs`.
3. **Verify pack**: `npm run pack` (JS only) or `npm run pack:full` (JS + native) — lists files that would be published.
4. **Publish**: `npm publish`.

See [NPM_PUBLISH_GUIDE.md](NPM_PUBLISH_GUIDE.md) for 2FA/token setup and troubleshooting.

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- [llama.cpp](https://github.com/ggerganov/llama.cpp) - The core inference engine
- [Capacitor](https://capacitorjs.com/) - The cross-platform runtime
- [llama.rn](https://github.com/mybigday/llama.rn) - Inspiration for the React Native implementation

## 📞 Support

- 📧 Email: support@arusatech.com
- 🐛 Issues: [GitHub Issues](https://github.com/arusatech/llama-cpp/issues)
- 📖 Documentation: [GitHub Wiki](https://github.com/arusatech/llama-cpp/wiki)