# Kusamoji 草文字

Segments Japanese text into morphemes and attaches part of speech, reading, and pronunciation metadata.

## Features

- **Viterbi tokenization** with IPADIC/NEologd dictionary support
- **Custom dictionary** — bring your own IPADIC/NEologd `.dat` files
- **OS-level native dict loading** — loads dictionary via memory-mapped I/O for near-instant boot (~1s vs ~4s) and OS-managed page cache
- **Automatic memory management** — lets the OS handle page caching; no manual tuning needed
- **Viterbi length bonus** — prevents short dictionary fragments from stealing prefixes of longer correct matches
- **Zero-copy TypedArray access** to binary dictionary data

## Install

```bash
pnpm install kusamoji
# or
npm add kusamoji
```

### How the native mmap addon works

kusamoji ships pre-compiled mmap binaries for common platforms. The addon is **optional** — kusamoji works without it, just with slower boot and higher RAM.

**You don't need to do anything.** On first use, kusamoji automatically:

1. Finds the matching prebuilt binary inside the package (`src/native/prebuilds/`)
2. Copies it to `~/.kusamoji/` for persistence across reinstalls
3. Loads it — mmap dict loading is now active

If no prebuilt matches your platform, kusamoji silently falls back to `fs.readFile`. Everything works — the mmap addon is a performance optimization, not a requirement.

### Shipped prebuilts

| Platform | Architecture          | Status                     |
| -------- | --------------------- | -------------------------- |
| macOS    | Apple Silicon (arm64) | ✅ Shipped                 |
| macOS    | Intel (x64)           | Compile from source        |
| Linux    | x64 (Intel/AMD)       | ✅ Shipped                 |
| Linux    | arm64 (Graviton, RPi) | ✅ Shipped                 |
| Windows  | any                   | Not supported (POSIX only) |

### Troubleshooting the native addon

**"I installed kusamoji but I'm not sure if mmap is active"**

```bash
node -e "
  const path = require('path');
  const loader = require(path.join(require.resolve('kusamoji'), '..', 'native', 'loader.js'));
  const addon = loader.loadMmapAddon();
  console.log(addon ? 'mmap is ACTIVE' : 'mmap is NOT active (using fs.readFile fallback)');
"
```

**"pnpm install didn't set up the addon"**

This is normal. pnpm may skip `postinstall` scripts for security. The addon is loaded lazily on first use — no manual setup needed. If you want to pre-warm the cache:

```bash
pnpm rebuild kusamoji
```

**"I want to compile the addon from source"**

For platforms without a shipped prebuilt, or if you want to rebuild:

```bash
npx kusamoji rebuild-native
```

Requires: C compiler (gcc/clang), Python 3. The compiled binary is cached at `~/.kusamoji/` and persists across `pnpm install` cycles.

**"I'm on an unsupported platform"**

kusamoji falls back to `fs.readFile` automatically. Dictionary loading still works — boot is ~3-4s instead of ~1s, and RAM is higher (~2.5 GB vs ~1.4 GB for NEologd). No action needed.

### Binary cache directory (`~/.kusamoji/`)

The native addon binary is cached at `~/.kusamoji/` along with a `config.json` metadata file. This cache:

- Survives `pnpm install` / `npm install` cycles
- Is validated against your Node.js N-API version on each load
- Is automatically refreshed when you upgrade Node.js to a new major version
- Can be safely deleted — it will be recreated on next use

## Quick Start

```js
const kusamoji = require('kusamoji')

const tokenizer = await kusamoji.builder({ dicPath: '/path/to/dict' }).buildAsync()

const tokens = tokenizer.tokenize('大谷翔平がロサンゼルス・ドジャースで3本塁打を放った')

for (const token of tokens) {
    console.log(token.surface_form, token.reading, token.pos)
}
// 大谷翔平      オオタニショウヘイ  名詞
// が            ガ                助詞
// ロサンゼルス   ロサンゼルス       名詞
// ・            ・                記号
// ドジャース     ドジャース         名詞
// で            デ                助詞
// 3             サン              名詞
// 本塁打        ホンルイダ         名詞
// を            ヲ                助詞
// 放っ          ハナッ             動詞
// た            タ                助動詞
```

### More examples

Dates, counters, and proper nouns are resolved natively from the dictionary — no preprocessing needed:

```js
tokenizer.tokenize('2026年4月9日、川崎市の製鉄所で作業員が転落する事故が発生した')
// 2026年      ニセンニジュウロクネン  名詞    ← full year reading
// 4月9日      シガツココノカ        名詞    ← month + day as one token
// 、          、                記号
// 川崎市      カワサキシ           名詞    ← place name
// の          ノ                助詞
// 製鉄所      セイテツジョ          名詞    ← rendaku: 所(ショ→ジョ)
// で          デ                助詞
// 作業員      サギョウイン          名詞
// が          ガ                助詞
// 転落        テンラク            名詞
// する        スル               動詞
// 事故        ジコ               名詞
// が          ガ                助詞
// 発生        ハッセイ            名詞
// し          シ                動詞
// た          タ                助動詞

tokenizer.tokenize('藤井聡太名人は第84期将棋名人戦で圧倒的な強さを見せた')
// 藤井聡太    フジイソウタ          名詞    ← NEologd proper noun
// 名人        メイジン            名詞
// は          ハ                助詞
// 第          ダイ               接頭詞
// 84期       ハチジュウヨンキ       名詞    ← digit+counter compound
// 将棋        ショウギ            名詞
// 名人戦      メイジンセン          名詞
// で          デ                助詞
// 圧倒的      アットウテキ          名詞
// な          ナ                助動詞
// 強          ツヨ               形容詞
// さ          サ                名詞
// を          ヲ                助詞
// 見せ        ミセ               動詞
// た          タ                助動詞
```

## Benchmarks

All numbers measured on Apple M1 Pro, Node.js 22, NEologd dictionary (6.1M entries, 1.4 GB uncompressed). Methodology: 700 real-world Japanese news snippets × 9 conversion variants = 6,300 HTTP calls end-to-end through an Express service.

### Cold start

| Mode         | Boot time | Ready for first query                               |
| ------------ | --------: | --------------------------------------------------- |
| **kusamoji** | **1.0 s** | Dictionary memory-mapped, OS demand-pages on access |
| kuromoji.js  |    8–12 s | gunzip + parse all 12 .dat.gz files                 |

### Runtime memory (RSS)

| Mode         |   Idle RSS | Under load (700 concurrent) |     Peak |
| ------------ | ---------: | --------------------------: | -------: |
| **kusamoji** | **1.4 GB** |                      2.2 GB |   3.1 GB |
| kuromoji.js  |     6–8 GB |                       8+ GB | OOM risk |

With mmap, the ~1.4 GB dictionary sits in the OS page cache, **not V8 heap**. Under memory pressure the OS evicts cold pages automatically. V8's garbage collector never sees the dictionary data.

### Tokenization throughput

| Input                      | Tokens/call | Latency (p50) |    Throughput |
| -------------------------- | ----------: | ------------: | ------------: |
| Short sentence (10 chars)  |          ~5 |    **0.3 ms** | 3,300 calls/s |
| News headline (50 chars)   |         ~20 |    **0.8 ms** | 1,250 calls/s |
| News article (500 chars)   |        ~150 |      **5 ms** |   200 calls/s |
| Long article (2,000 chars) |        ~600 |     **18 ms** |    55 calls/s |

### Accuracy (6,300-call harness)

700 real-world news snippets from Yahoo News Japan, NHK, and Mainichi — mixed content with ASCII brand names, URLs, numbers, brackets, and quoted English.

You can find the feeding news snippets here [Kusamoji Test News Snippets](https://github.com/KimuraRisei/kusamoji-test-news-snippets)

| Metric                              | Score                                    |
| ----------------------------------- | ---------------------------------------- |
| Romaji conversion (5 systems × 700) | **99.0%** kanji-free output              |
| Kana conversion (4 modes × 700)     | **99.0%** kanji-free output              |
| Jukujikun (熟字訓) accuracy         | **48 / 49** tested compounds             |
| Proper noun accuracy (NEologd)      | **10 / 10** (大谷翔平, 宮崎駿, etc.)     |
| Place name accuracy                 | **10 / 10** (東京, 鹿児島, 秋葉原, etc.) |
| File descriptor leaks               | **0** after 6,300 calls                  |

### vs. alternatives

| Feature              | kusamoji                | kuromoji.js    | MeCab (C++)     | Sudachi (Java/Rust) |
| -------------------- | ----------------------- | -------------- | --------------- | ------------------- |
| Runtime              | Node.js                 | Node.js        | Native binary   | JVM / Native        |
| Dict loading         | **mmap (zero-copy)**    | gunzip to heap | mmap            | mmap (Rust)         |
| Boot time (NEologd)  | **~1 s**                | ~10 s          | ~0 s            | ~0.2 s              |
| RSS (NEologd)        | **~1.4 GB**             | ~6-8 GB        | ~0.5 GB         | ~0.2 GB             |
| Viterbi optimization | **Length bonus**        | None           | Cost estimation | CowArray            |
| POS source strategy  | **Pluggable (3 modes)** | In-heap only   | mmap            | mmap                |
| NEologd support      | ✅                      | ✅             | ✅              | ✅ (built-in)       |
| Node.js native       | ✅                      | ✅             | FFI required    | FFI required        |
| npm install          | ✅ `npm i kusamoji`     | ✅             | ❌              | ❌                  |
| Zero native deps     | ✅ (optional mmap)      | ✅             | N/A             | N/A                 |

> **Note:** MeCab and Sudachi achieve lower RSS because they're compiled languages with direct memory management. kusamoji's mmap addon brings Node.js RSS within 4× of native C++ — the closest any pure-npm Japanese tokenizer has gotten.

## API

### `kusamoji.builder(options)`

Returns a `TokenizerBuilder`.

| Option    | Type     | Required | Description                                          |
| --------- | -------- | -------- | ---------------------------------------------------- |
| `dicPath` | `string` | Yes      | Path to the directory containing the 12 `.dat` files |

### `builder.buildAsync()` → `Promise<Tokenizer>`

Loads the dictionary and returns a `Tokenizer` instance.

### `builder.build(callback)`

Callback-style variant: `callback(err, tokenizer)`.

### `tokenizer.tokenize(text)` → `Token[]`

Tokenizes input text. Returns an array of tokens:

```js
{
    surface_form: "東京",       // as it appears in the text
    pos: "名詞",                // part of speech
    pos_detail_1: "固有名詞",   // POS subcategory 1
    pos_detail_2: "地域",       // POS subcategory 2
    pos_detail_3: "一般",       // POS subcategory 3
    conjugated_type: "*",       // conjugation type
    conjugated_form: "*",       // conjugated form
    basic_form: "東京",         // dictionary form
    reading: "トウキョウ",      // reading in katakana
    pronunciation: "トーキョー", // pronunciation in katakana
    word_type: "KNOWN",         // "KNOWN" or "UNKNOWN"
}
```

Returns `[]` for `null`, `undefined`, or empty string input.

## Dictionary Files

kusamoji does NOT bundle a dictionary. You need 12 **uncompressed** `.dat` files compiled from IPADIC (or IPADIC-format compatible) CSV sources:

```
base.dat, check.dat, cc.dat, tid.dat, tid_map.dat, tid_pos.dat,
unk.dat, unk_char.dat, unk_compat.dat, unk_invoke.dat, unk_map.dat, unk_pos.dat
```

### Building a dictionary

Use the included build script with IPADIC CSV sources:

```bash
node node_modules/kusamoji/dict-source/build.mjs \
    --source /path/to/csv-sources \
    --output /path/to/output
```

The source directory must contain:

- `ipadic/` — base IPADIC CSV files + `matrix.def`, `char.def`, `unk.def`
- `custom/` — (optional) your own override entries

## License

[BSL 1.1](LICENSE) — free for personal and non-commercial use. Commercial use requires a license. Change date: 4 years from release.
