# @tanstack/ai-code-mode

Code Mode for TanStack AI — let LLMs write and execute TypeScript in secure sandboxes with typed tool access.

## Overview

Code Mode gives your AI agent an `execute_typescript` tool. Instead of one tool call per action, the LLM writes a small TypeScript program that orchestrates multiple tool calls with loops, conditionals, `Promise.all`, and data transformations — all running in an isolated sandbox.

## Installation

```bash
pnpm add @tanstack/ai-code-mode
```

You also need an isolate driver:

```bash
# Node.js (fastest, uses V8 isolates via isolated-vm)
pnpm add @tanstack/ai-isolate-node

# QuickJS WASM (browser-compatible, no native deps)
pnpm add @tanstack/ai-isolate-quickjs

# Cloudflare Workers (edge execution)
pnpm add @tanstack/ai-isolate-cloudflare
```

## Quick Start

```typescript
import { chat, toolDefinition } from '@tanstack/ai'
import { createCodeMode } from '@tanstack/ai-code-mode'
import { createNodeIsolateDriver } from '@tanstack/ai-isolate-node'
import { z } from 'zod'

// Define tools that the LLM can call from inside the sandbox
const weatherTool = toolDefinition({
  name: 'fetchWeather',
  description: 'Get weather for a city',
  inputSchema: z.object({ location: z.string() }),
  outputSchema: z.object({ temperature: z.number(), condition: z.string() }),
}).server(async ({ location }) => {
  // Your implementation
  return { temperature: 72, condition: 'sunny' }
})

// Create the execute_typescript tool and system prompt
const { tool, systemPrompt } = createCodeMode({
  driver: createNodeIsolateDriver(),
  tools: [weatherTool],
})

const result = await chat({
  adapter: yourAdapter,
  model: 'gpt-4o',
  systemPrompts: ['You are a helpful assistant.', systemPrompt],
  tools: [tool],
  messages: [
    { role: 'user', content: 'Compare weather in Tokyo, Paris, and NYC' },
  ],
})
```

The LLM will generate code like:

```typescript
const cities = ['Tokyo', 'Paris', 'NYC']
const results = await Promise.all(
  cities.map((city) => external_fetchWeather({ location: city })),
)
const warmest = results.reduce((prev, curr) =>
  curr.temperature > prev.temperature ? curr : prev,
)
return { warmestCity: warmest.location, temperature: warmest.temperature }
```

## API Reference

### `createCodeMode(config)`

Creates both the `execute_typescript` tool and its matching system prompt. This is the recommended entry point.

**Config:**

- `driver` — An `IsolateDriver` (Node, QuickJS, or Cloudflare)
- `tools` — Array of `ServerTool` or `ToolDefinition` instances. Exposed as `external_*` functions in the sandbox
- `timeout` — Execution timeout in ms (default: 30000)
- `memoryLimit` — Memory limit in MB (default: 128, supported by Node and QuickJS drivers)
- `getSkillBindings` — Optional async function returning dynamic bindings

### `createCodeModeTool(config)` / `createCodeModeSystemPrompt(config)`

Lower-level functions if you need only the tool or only the prompt. `createCodeMode` calls both internally.

### Advanced

These utilities are used internally and exported for custom pipelines:

- **`stripTypeScript(code)`** — Strips TypeScript syntax using esbuild.
- **`toolsToBindings(tools, prefix?)`** — Converts tools to `ToolBinding` records for sandbox injection.
- **`generateTypeStubs(bindings, options?)`** — Generates TypeScript type declarations from tool bindings.

## Driver Selection Guide

| Driver                            | Best For                                     | Native Deps         | Browser | Memory Limit |
| --------------------------------- | -------------------------------------------- | ------------------- | ------- | ------------ |
| `@tanstack/ai-isolate-node`       | Server-side Node.js apps                     | Yes (`isolated-vm`) | No      | Yes          |
| `@tanstack/ai-isolate-quickjs`    | Browser, edge, or no-native-dep environments | No (WASM)           | Yes     | Yes          |
| `@tanstack/ai-isolate-cloudflare` | Cloudflare Workers deployments               | No                  | N/A     | N/A          |

## Custom Events

Code Mode emits custom events during execution that you can observe via the TanStack AI event system:

| Event                         | Description                                         |
| ----------------------------- | --------------------------------------------------- |
| `code_mode:execution_started` | Emitted when code execution begins                  |
| `code_mode:console`           | Emitted for each `console.log/error/warn/info` call |
| `code_mode:external_call`     | Emitted before each `external_*` function call      |
| `code_mode:external_result`   | Emitted after a successful `external_*` call        |
| `code_mode:external_error`    | Emitted when an `external_*` call fails             |

## Models eval (development)

The benchmark lives in a **separate workspace package** so `@tanstack/ai-code-mode` does not depend on `@tanstack/ai-isolate-node` (avoids an Nx build cycle). See `models-eval/package.json` (`@tanstack/ai-code-mode-models-eval`).

1. `packages/typescript/ai-code-mode/models-eval/pull-models.sh` — pull recommended Ollama models
2. `pnpm --filter @tanstack/ai-code-mode-models-eval eval:capture` — run models and capture raw outputs/telemetry only (no judge LLM call)
3. `pnpm --filter @tanstack/ai-code-mode-models-eval eval:judge` — judge latest captured session from logs (no model rerun)
4. `pnpm --filter @tanstack/ai-code-mode-models-eval eval` — single-pass run+judge (legacy convenience mode)
5. `pnpm --filter @tanstack/ai-code-mode-models-eval eval -- --ollama-only` — only Ollama models from `eval-config.ts`
6. `pnpm --filter @tanstack/ai-code-mode-models-eval eval -- --ollama-only --models qwen3-coder` — one or more model ids (comma-separated)

Judge-phase flags:

- `--judge-latest` judge latest captured session
- `--rejudge` re-run judging even if logs already contain judge fields

The default list omits some small Ollama models that rarely complete code-mode successfully (see comments in `eval-config.ts`). You can still benchmark them with `--models granite4:3b` etc. if pulled locally.

### Model comparison metrics

The models eval now tracks seven decision-oriented metrics plus an overall rating:

- `accuracy` (1-10): numerical/factual correctness vs gold report
- `comprehensiveness` (1-10): whether the response covers everything requested by the user query
- `typescriptQuality` (1-10): quality/readability/type-safety of generated TypeScript
- `codeModeEfficiency` (1-10): how efficiently the model uses code-mode/tooling to reach the answer
- `speedTier` (1-5): relative wall-clock speed against peers in the same category (`local` or `cloud`)
- `tokenEfficiencyTier` (1-5): relative token efficiency against peers in the same category
- `stabilityTier` (1-5): success consistency over the latest 5 logged runs for that model
- `stars` (1-3): weighted rollup score across all metrics

Raw run telemetry also includes compile/runtime failures, redundant schema checks, total tool calls, TTFT, token totals, stability sample size/rate, and per-model logs.

### Methodology

Canonical output is written to `packages/typescript/ai-code-mode/models-eval/results.json` after each capture or judge run.

- Benchmark: single code-mode benchmark prompt over the in-memory `customers` / `products` / `purchases` dataset
- Primary quality scores (judge): `accuracy`, `comprehensiveness`, `typescriptQuality`, `codeModeEfficiency`
- Computed comparative scores: `speedTier`, `tokenEfficiencyTier`, `stabilityTier`
- Stability definition: a run is "stable" if it has no top-level run error, produces a non-empty candidate report, and has at least one successful `execute_typescript` call
- Star rollup weights:
  - accuracy: 25%
  - comprehensiveness: 15%
  - typescriptQuality: 15%
  - codeModeEfficiency (with compile/runtime failure penalty): 10%
  - speedTier: 10%
  - tokenEfficiencyTier: 10%
  - stabilityTier: 15%

### Model comparison table

The table below is transcribed from canonical `models-eval/results.json` (session `2026-03-26T15:38:44.006Z`).

| Provider  | Model                         | Category | Stars | Accuracy | Comprehensiveness | TypeScript | Code-Mode | Speed Tier | Token Tier | Stability Tier |
| --------- | ----------------------------- | -------- | ----- | -------- | ----------------- | ---------- | --------- | ---------- | ---------- | -------------- |
| Ollama    | `gpt-oss:20b`                 | local    | ★★★   | 10       | 8                 | 5          | 5         | 5          | 5          | 5              |
| Ollama    | `nemotron-cascade-2`          | local    | ★★☆   | 3        | 5                 | 6          | 5         | 1          | 5          | 5              |
| Anthropic | `claude-haiku-4-5`            | cloud    | ★★★   | 10       | 10                | 6          | 7         | 3          | 2          | 5              |
| OpenAI    | `gpt-4o-mini`                 | cloud    | ★★★   | 10       | 8                 | 7          | 9         | 3          | 1          | 5              |
| Gemini    | `gemini-2.5-flash`            | cloud    | ★★★   | 10       | 8                 | 7          | 10        | 4          | 2          | 5              |
| xAI       | `grok-4-1-fast-non-reasoning` | cloud    | ★★★   | 10       | 8                 | 6          | 10        | 4          | 5          | 5              |
| Groq      | `llama-3.3-70b-versatile`     | cloud    | ★★★   | 10       | 7                 | 6          | 9         | 5          | 3          | 4              |
| Groq      | `qwen/qwen3-32b`              | cloud    | ★★☆   | 10       | 8                 | 5          | 4         | 1          | 2          | 5              |

Suggested interpretation:

- **Local-first**: favor `stars >= 2` with high speed tier.
- **Cloud-first quality**: favor high `accuracy` + `typescriptQuality`, then compare stars.
- **Cost-sensitive**: prioritize `tokenEfficiencyTier` and `speedTier` together.

## License

MIT