# Performance

Performance characteristics of Universal Emoji Parser and the levers you have when something gets slow. The package is intentionally simple — most "performance" work here is about _not_ introducing slow paths in new code.

## Hot paths

### `parse(text, options)` end-to-end

```
parse()
  ├── getDefaultOptions()              ← O(1) — small object merge
  ├── parseToShortcode(text)?          ← O(N · M) where N = text length, M = catalog size
  ├── parseToUnicode(text)?            ← O(K · M_avg) where K = #shortcodes in text
  └── __parseEmojiToHtml(text)?        ← O(E + N · E) where E = #emoji entities
```

For typical inputs (a chat message with 1–10 emojis), the whole pipeline runs in under a millisecond. The catalog lookups are O(1) on direct slug hits and O(M) on keyword-scan fallbacks.

### `parseToShortcode` is the slowest

It builds a single regex from `Object.keys(emojiLibJsonData).join('|')` — a 1906-alternation pattern — and runs `text.matchAll` over it. Two costs:

1. **RegExp construction** is O(M) and happens **on every call**. With 1906 alternates and the keycap escape, this is the dominant cost for short inputs
2. **Matching** is O(N · M) in the worst case (alternation regexes don't backtrack as efficiently as character classes)

If a consumer calls `parseToShortcode` in a hot loop, **the regex construction dominates**. Caching it would help but introduces stateful behavior — currently the package recreates it every call.

### Catalog lookup priority

`getEmojiObjectByShortcode(shortcode)`:

1. **Direct hit** — `emojiLibJsonData[shortcode]` — O(1)
2. **Keyword scan** — `Object.keys(...).find(...)` — O(M · K_avg) where K_avg ≈ 5 (average keywords per emoji)

For canonical slugs the fast path always wins. For dialect aliases (`:thumbsup:` → 👍), the scan is unavoidable. There's no inverted index — it would double the catalog memory for a marginal speedup.

## Optimizations already in place

| Optimization                             | Where                                             | Why it matters                                                       |
| ---------------------------------------- | ------------------------------------------------- | -------------------------------------------------------------------- |
| Catalog as static JSON import            | `import emojiLibJson from './lib/emoji-lib.json'` | Bundlers inline it as a JS object literal — no JSON.parse at runtime |
| Single Twemoji parse per HTML conversion | `__parseEmojiToHtml`                              | Twemoji is the slowest single op; calling it twice is wasted work    |
| `entitiesFound` dedup                    | `__parseEmojiToHtml`                              | Same emoji appearing 5× → 1 regex replace, not 5                     |
| Two-tier shortcode lookup                | `getEmojiObjectByShortcode`                       | Slug path is O(1); keyword scan only when needed                     |
| Frozen-by-convention catalog             | `emojiLibJsonData`                                | Consumers can safely cache references; nothing mutates               |

## Optimizations _not_ applied (and why)

| Not done                                       | Reason                                                                                           |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| RegExp caching for `parseToShortcode`          | Adds stateful module-level cache; tests pass without it; speeds matter only in tight loops       |
| Inverted keyword index (`{ keyword: emoji }`)  | Doubles memory of the catalog (~5 MB resident) for marginal speedup; current scan is fast enough |
| Streaming/chunked parsing for very long inputs | Unrealistic input size for this package's domain (chat messages, blog posts)                     |
| Web Worker offload                             | Not the package's job — consumers can `worker.postMessage(text)` themselves                      |
| Async API                                      | Adds complexity without benefit; everything is in-memory                                         |

If a consumer benchmarks a real workload and finds these to be the bottleneck, file an issue with numbers and we'll reconsider.

## Bundle size

The package adds ~600 KB minified (~543 KB JSON catalog + ~50 KB code + Twemoji) to consumer bundles. This dominates everything:

```bash
ls -lh dist/index.js                          # ~600 KB minified
ls -lh src/lib/emoji-lib.json                  # ~543 KB raw
```

### Why so big?

The catalog has 1906 entries × average ~250 bytes per entry (name, slug, group, version, char, keywords array). Most of the size is keyword arrays and the `name` field.

### What we won't drop

- `name` — useful for accessibility (`alt=` could use it; we currently use the unicode char instead)
- `slug` — the canonical shortcode for `parseToShortcode`
- `keywords` — the alias support is the package's main value-add
- `char` — used by `parseToUnicode` and `__parseEmojiToHtml`

### What might drop in a future major

- `group`, `emoji_version`, `unicode_version`, `skin_tone_support` — currently exported as part of `EmojiType` but **never read by the runtime**. They're metadata for consumers using `emojiLibJsonData` directly. Removing them would save ~30% of the catalog size but breaks any consumer that reads them. Slate for a 3.x migration discussion

### Lazy-loading for consumers

A consumer worried about initial-load performance can lazy-import:

```ts
let parserPromise: Promise<typeof import('universal-emoji-parser').default> | null = null

async function parseLazy(text: string): Promise<string> {
  parserPromise ??= import('universal-emoji-parser').then((m) => m.default)
  return (await parserPromise).parse(text)
}
```

Webpack/Vite/Rollup turn this into a code-split chunk — the catalog only ships when first needed. Trade-off: first call awaits a network fetch.

### Tree-shaking

The catalog is **not tree-shakeable** — `getEmojiObjectByShortcode` enumerates all keys via `Object.keys(emojiLibJsonData)`, so every emoji is reachable. Even consumers who only call `parseToHtml` ship the whole catalog.

A custom subset (e.g., "only emojis used in our app") would require a fork or a wrapper package that pre-filters at build time. Out of scope for this package.

## Memory footprint

Per-process overhead:

- **Catalog** — ~5 MB resident (parsed JS object representation of the 543 KB JSON)
- **Code** — ~50 KB
- **`@twemoji/parser`** — ~50 KB

Loaded once per process; doesn't grow with usage.

In Node, the catalog is the biggest single string-heavy object in `process.memoryUsage().heapUsed` for any app that uses this package and not much else. Not a problem for typical Node servers; relevant for memory-constrained environments (256 MB Lambdas, Cloudflare Workers).

## Throughput benchmarks

We don't have a wired-up benchmark suite. Rough numbers from manual measurement (Node 22 on M1):

| Operation                                      | Input                               | Latency                                |
| ---------------------------------------------- | ----------------------------------- | -------------------------------------- |
| `parse('hello')`                               | No emojis                           | < 0.1 ms                               |
| `parse('hello :smile: 🚀')`                    | 2 emojis, 1 shortcode               | ~0.3 ms                                |
| `parse(<200 char chat message with 5 emojis>)` | Realistic chat                      | ~0.5 ms                                |
| `parseToShortcode('🚀 ⭐️ ❤️ 😎 🔥')`           | 5 unicodes                          | ~0.8 ms (regex construction dominates) |
| `parseToShortcode(<1 KB text>)`                | Same alternation regex, longer scan | ~1.5 ms                                |
| `parseToHtml(<1 KB text with 20 emojis>)`      | Full pipeline                       | ~2 ms                                  |

Twemoji's `parse()` is the dominant cost in `parseToHtml`. Catalog lookups are sub-microsecond.

If a consumer reports >10 ms latencies on realistic input, that's a bug — open an issue.

## Adding new code paths — performance checklist

When adding a new method or feature:

- [ ] **No async.** Don't introduce `Promise` returns or `await` calls — the catalog is in-memory; sync is faster and simpler
- [ ] **No catalog mutation.** `emojiLibJsonData` must remain a reference-stable, deep-frozen-by-convention object
- [ ] **Cache RegExp where possible.** If a regex doesn't depend on the input, build it once at module init, not per call
- [ ] **No iteration of the catalog** if a direct lookup will do. `emojiLibJsonData[slug]` is always faster than `Object.keys(...).find(...)`
- [ ] **No new fields on `EmojiType`** without measuring the bundle-size delta. Each field × 1906 entries × every consumer's bundle adds up
- [ ] **Test with realistic input.** `:smile:` is fine for unit tests; don't optimize for the trivial case at the cost of long inputs

## Profiling

Quick profile of a hot call:

```bash
node --prof -e "const u = require('./dist/index.js'); for (let i = 0; i < 10000; i++) u.parse('hello :smile: 🚀 ⭐️ ❤️ 😎')"
node --prof-process isolate-*.log
```

The output will show you which functions dominate. Expected: `RegExp.prototype[@@matchAll]` and the alternation regex construction in `parseToShortcode` — confirming the analysis above.

For browser perf, Chrome DevTools' Performance panel works on bundles built by `npm run build:dev` (production minification obscures function names).

## Common performance mistakes

1. **Constructing a new `RegExp` per loop iteration** — pull it out:
   ```ts
   // ❌
   for (const t of texts) {
     const r = new RegExp(...)    // built every iteration
     t.match(r)
   }
   // ✅
   const r = new RegExp(...)
   for (const t of texts) t.match(r)
   ```
2. **Cloning the catalog** — `JSON.parse(JSON.stringify(emojiLibJsonData))` is the regenerator's pattern; never do it at runtime
3. **`for-in` over the catalog** — `for-in` is slower than `Object.keys(...).forEach`. The package uses `Object.keys` consistently
4. **`indexOf` chains** — `array.indexOf(x) !== -1` is slower than `array.includes(x)` and harder to read
5. **Reading `emojiLibJsonData.length`** — there's no `length`; it's an object, not an array. Use `Object.keys(emojiLibJsonData).length`
