<p align="center">
  <a href="#readme"><img alt="TagSoup" src="./assets/logo.png" width="250" /></a>
</p>

TagSoup is [the fastest](#performance) pure JS SAX/DOM XML/HTML parser and serializer.

- Extremely low memory consumption.
- Tolerant of malformed tag nesting, missing end tags, etc.
- Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.
- Supports both strict XML and forgiving HTML parsing modes.
- [20 kB gzipped](https://bundlephobia.com/result?p=tag-soup), including dependencies.
- Check out TagSoup dependencies: [Speedy Entities](https://github.com/smikhalevski/speedy-entities#readme)
  and [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme).

```sh
npm install --save-prod tag-soup
```

- [API docs](https://smikhalevski.github.io/tag-soup/)
- [DOM parsing](#dom-parsing)
- [SAX parsing](#sax-parsing)
- [Tokenization](#tokenization)
- [Serialization](#serialization)
- [Parser options](#parser-options)
- [Performance](#performance)
- [Limitations](#limitations)

# DOM parsing

TagSoup exports preconfigured [`HTMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLDOMParser.html)
which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:

```ts
import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('<p>hello<p>cool</br>');
// ⮕ DocumentFragment

toHTML(fragment);
// ⮕ '<p>hello</p><p>cool<br></p>'
```

`HTMLDOMParser` decodes both HTML entities and numeric character references with
[`decodeHTML`](https://smikhalevski.github.io/speedy-entities/variables/decodeHTML.html).

[`XMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/XMLDOMParser.html)
parses XML markup as a DOM node. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts
import { XMLDOMParser, toXML } from 'tag-soup';

XMLDOMParser.parseFragment('<p>hello</br>');
// ❌ ParserError: Unexpected end tag.

const fragment = XMLDOMParser.parseFragment('<p>hello<br/></p>');
// ⮕ DocumentFragment

toXML(fragment);
// ⮕ '<p>hello<br/></p>'
```

`XMLDOMParser` decodes both XML entities and numeric character references with
[`decodeXML`](https://smikhalevski.github.io/speedy-entities/variables/decodeXML.html).

TagSoup uses [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme) nodes, which provide many standard
DOM manipulation features:

```ts
const document = HTMLDOMParser.parseDocument('<!DOCTYPE html><html>hello</html>');

document.doctype.name;
// ⮕ 'html'

document.textContent;
// ⮕ 'hello'
```

For example, you can use `TreeWalker` to traverse DOM nodes:

```ts
import { TreeWalker, NodeFilter } from 'flyweight-dom';

const fragment = XMLDOMParser.parseFragment('<p>hello<br/></p>');

const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);

treeWalker.nextNode();
// ⮕ Text { 'hello' }
```

Create a custom DOM parser using
[`createDOMParser`](https://smikhalevski.github.io/tag-soup/functions/createDOMParser.html):

```ts
import { createDOMParser } from 'tag-soup';

const myParser = createDOMParser({
  voidTags: ['br'],
});

myParser.parseFragment('<p><br></p>');
// ⮕ DocumentFragment
```

# SAX parsing

TagSoup exports preconfigured [`HTMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html)
which parses HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing
and forgives malformed markup:

```ts
import { HTMLSAXParser } from 'tag-soup';

HTMLSAXParser.parseFragment('<p>hello<p>cool</br>', {
  onStartTagOpening(tagName) {
    // Called with 'p', 'p', and 'br'
  },
  onText(text) {
    // Called with 'hello' and 'cool'
  },
});
```

[`XMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/XMLSAXParser.html) parses XML markup and calls
handler methods when a token is read. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts
import { XMLSAXParser } from 'tag-soup';

XMLSAXParser.parseFragment('<p>hello</br>', {});
// ❌ ParserError: Unexpected end tag.

XMLSAXParser.parseFragment('<p>hello<br/></p>', {
  onEndTag(tagName) {
    // Called with 'br' and 'p'
  },
});
```

Create a custom SAX parser using
[`createSAXParser`](https://smikhalevski.github.io/tag-soup/functions/createSAXParser.html):

```ts
import { createSAXParser } from 'tag-soup';

const myParser = createSAXParser({
  voidTags: ['br'],
});

myParser.parseFragment('<p><br></p>', {
  onStartTagOpening(tagName) {
    // Called with 'p' and 'br'
  },
});
```

## SAX handler callbacks

The [`SAXHandler`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html) defines the following optional
callbacks. Implement only the ones you need.

| Callback                                                                                                                | Description                                    |
| :---------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------- |
| [`onStartTagOpening`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onstarttagopening)             | A start tag name is read.                      |
| [`onAttribute`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onattribute)                         | An attribute and its decoded value were read.  |
| [`onStartTagClosing`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onstarttagclosing)             | A start tag is closed `>`.                     |
| [`onStartTagSelfClosing`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onstarttagselfclosing)     | A start tag is self-closed `/>`.               |
| [`onStartTag`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onstarttag)                           | A start tag and its atributes were read.       |
| [`onEndTag`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onendtag)                               | An end tag matching an open start tag is read. |
| [`onText`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#ontext)                                   | A decoded text content is read.                |
| [`onComment`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#oncomment)                             | A comment is read.                             |
| [`onDoctype`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#ondoctype)                             | A DOCTYPE declaration is read.                 |
| [`onCDATASection`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#oncdatasection)                   | A CDATA section is read.                       |
| [`onProcessingInstruction`](https://smikhalevski.github.io/tag-soup/interfaces/SAXHandler.html#onprocessinginstruction) | A processing instruction is read.              |

Example using several callbacks at once:

```ts
import { HTMLSAXParser } from 'tag-soup';

HTMLSAXParser.parseFragment('<!-- greeting --><p class="x">hello</p>', {
  onComment(data) {
    // Called with ' greeting '
  },
  onStartTagOpening(tagName) {
    // Called with 'p'
  },
  onAttribute(name, value) {
    // Called with 'class', 'x'
  },
  onStartTagClosing() {
    // Called after all attributes of 'p' are read
  },
  onStartTag(tagName, attributes, isSelfClosing) {
    // Called after onStartTagClosing
  },
  onText(text) {
    // Called with 'hello'
  },
  onEndTag(tagName) {
    // Called with 'p'
  },
});
```

# Tokenization

TagSoup exports preconfigured
[`HTMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html) which parses HTML markup and
invokes a callback when a token is read. This tokenizer never throws errors during tokenization and forgives malformed
markup:

```ts
import { HTMLTokenizer } from 'tag-soup';

HTMLTokenizer.tokenizeFragment('<p>hello<p>cool</br>', (token, startIndex, endIndex) => {
  // Handle token
});
```

[`XMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/XMLTokenizer.html) parses XML markup and invokes
a callback when a token is read. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts
import { XMLTokenizer } from 'tag-soup';

XMLTokenizer.tokenizeFragment('<p>hello</br>', (token, startIndex, endIndex) => {});
// ❌ ParserError: Unexpected end tag.

XMLTokenizer.tokenizeFragment('<p>hello<br/></p>', (token, startIndex, endIndex) => {
  // Handle token
});
```

Create a custom tokenizer using
[`createTokenizer`](https://smikhalevski.github.io/tag-soup/functions/createTokenizer.html):

```ts
import { createTokenizer } from 'tag-soup';

const myTokenizer = createTokenizer({
  voidTags: ['br'],
});

myTokenizer.tokenizeFragment('<p><br></p>', (token, startIndex, endIndex) => {
  // Handle token
});
```

The [`Token`](https://smikhalevski.github.io/tag-soup/types/Token.html) passed to the callback is one of the
following string literals. `startIndex` and `endIndex` are the character positions of the token's value in the input.

| Token                             | Description                                                          |
| :-------------------------------- | :------------------------------------------------------------------- |
| `"TEXT"`                          | Text content between tags.                                           |
| `"START_TAG_NAME"`                | The name portion of an opening tag, e.g. `p` in `<p>`.               |
| `"START_TAG_CLOSING"`             | The `>` that closes an opening tag.                                  |
| `"START_TAG_SELF_CLOSING"`        | The `/>` that self-closes a tag.                                     |
| `"END_TAG_NAME"`                  | The name portion of a closing tag, e.g. `p` in `</p>`.               |
| `"ATTRIBUTE_NAME"`                | An attribute name.                                                   |
| `"ATTRIBUTE_VALUE"`               | A decoded attribute value.                                           |
| `"COMMENT"`                       | Comment content, excluding `<!--` and `-->`.                         |
| `"PROCESSING_INSTRUCTION_TARGET"` | The target of a processing instruction, e.g. `xml` in `<?xml ...?>`. |
| `"PROCESSING_INSTRUCTION_DATA"`   | The data portion of a processing instruction.                        |
| `"CDATA_SECTION"`                 | Content of a CDATA section, excluding `<![CDATA[` and `]]>`.         |
| `"DOCTYPE_NAME"`                  | The name in a DOCTYPE declaration, e.g. `html` in `<!DOCTYPE html>`. |

# Serialization

TagSoup exports two preconfigured serializers:
[`toHTML`](https://smikhalevski.github.io/tag-soup/variables/toHTML.html) and
[`toXML`](https://smikhalevski.github.io/tag-soup/variables/toXML.html).

```ts
import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('<p>hello<p>cool</br>');
// ⮕ DocumentFragment

toHTML(fragment);
// ⮕ '<p>hello</p><p>cool<br></p>'
```

Create a custom serializer using
[`createSerializer`](https://smikhalevski.github.io/tag-soup/functions/createSerializer.html):

```ts
import { HTMLDOMParser, createSerializer } from 'tag-soup';

const mySerializer = createSerializer({
  voidTags: ['br'],
});

const fragment = HTMLDOMParser.parseFragment('<p>hello</br>');
// ⮕ DocumentFragment

mySerializer(fragment);
// ⮕ '<p>hello<br></p>'
```

[`SerializerOptions`](https://smikhalevski.github.io/tag-soup/interfaces/SerializerOptions.html) accepts the
following properties:

| Option                                                                                                                                  | Description                                                       |
| :-------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------- |
| [`voidTags`](https://smikhalevski.github.io/tag-soup/interfaces/SerializerOptions.html#voidtags)                                        | Tags that have no content and no closing tag (e.g. `br`, `img`).  |
| [`encodeText`](https://smikhalevski.github.io/tag-soup/interfaces/SerializerOptions.html#encodetext)                                    | Callback to encode text content and attribute values.             |
| [`areSelfClosingTags​Supported`](https://smikhalevski.github.io/tag-soup/interfaces/SerializerOptions.html#areselfclosingtagssupported) | If `true`, void tags are serialized as `<br/>` instead of `<br>`. |
| [`areTagNamesCaseInsensitive`](https://smikhalevski.github.io/tag-soup/interfaces/SerializerOptions.html#aretagnamescaseinsensitive)    | If `true`, tag name comparisons are case-insensitive.             |

Serialize XML with entity encoding:

```ts
import { XMLDOMParser, createSerializer } from 'tag-soup';
import { encodeXML } from 'speedy-entities';

const toXMLEncoded = createSerializer({
  areSelfClosingTagsSupported: true,
  encodeText: encodeXML,
});

const fragment = XMLDOMParser.parseFragment('<note><text>AT&amp;T</text></note>');

toXMLEncoded(fragment);
// ⮕ '<note><text>AT&amp;T</text></note>'
```

# Parser options

`createDOMParser`, `createSAXParser`, and `createTokenizer` accept a
[`ParserOptions`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html) object.

| Option                                                                                                                                                     | Description                                                                                                                                                                    |
| :--------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`voidTags`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#voidtags)                                                               | Tags that have no content and no end tag (e.g. `br`, `img`). See [HTML5 Void Elements](https://www.w3.org/TR/2010/WD-html5-20101019/syntax.html#void-elements).                |
| [`rawTextTags`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#rawtexttags)                                                         | Tags whose content is treated as raw text (e.g. `script`, `style`). See [HTML5 Raw Text Elements](https://www.w3.org/TR/2010/WD-html5-20101019/syntax.html#raw-text-elements). |
| [`foreignTags`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#foreigntags)                                                         | Map from a tag to tokenizer options that are applied inside the tag.                                                                                                           |
| [`decodeText`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#decodetext)                                                           | Callback to decode text content and attribute values (e.g. `decodeHTML` from `speedy-entities`).                                                                               |
| [`implicitlyClosedTags`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#implicitlyclosedtags)                                       | Map from a tag to the list of open tags it implicitly closes. For example `{ h1: ['p'] }` means an opening `<h1>` closes any currently open `<p>`.                             |
| [`implicitlyOpenedTags`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#implicitlyopenedtags)                                       | Tags for which a synthetic start tag is inserted when an unbalanced end tag is encountered (e.g. `['p', 'br']` so `</p>` becomes `<p></p>`).                                   |
| [`areTagNames​CaseInsensitive`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#aretagnamescaseinsensitive)                          | If `true`, tag name comparisons ignore ASCII case.                                                                                                                             |
| [`areCDATASections​Recognized`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#arecdatasectionsrecognized)                          | If `true`, CDATA sections (`<![CDATA[...]]>`) are recognized.                                                                                                                  |
| [`areProcessing​Instructions​Recognized`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#areprocessinginstructionsrecognized)       | If `true`, processing instructions (`<?target data?>`) are recognized.                                                                                                         |
| [`areSelfClosingTags​Recognized`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#areselfclosingtagsrecognized)                      | If `true`, self-closing tags (`<br/>`) are recognized; otherwise treated as start tags.                                                                                        |
| [`isStrict`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#isstrict)                                                               | If `true`, tag names and attributes are validated against XML constraints.                                                                                                     |
| [`areUnbalanced​EndTags​Ignored`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#areunbalancedendtagsignored)                       | If `true`, end tags without a matching start tag are silently dropped instead of throwing.                                                                                     |
| [`areUnbalanced​StartTags​ImplicitlyClosed`](https://smikhalevski.github.io/tag-soup/interfaces/ParserOptions.html#areunbalancedstarttagsimplicitlyclosed) | If `true`, unclosed start tags are forcefully closed at the end of their parent.                                                                                               |

A parser that mimics browser HTML behavior:

```ts
import { createDOMParser } from 'tag-soup';
import { decodeHTML } from 'speedy-entities';

const myParser = createDOMParser({
  voidTags: [
    'area',
    'base',
    'br',
    'col',
    'embed',
    'hr',
    'img',
    'input',
    'link',
    'meta',
    'param',
    'source',
    'track',
    'wbr',
  ],
  rawTextTags: ['script', 'style'],
  decodeText,
  areTagNamesCaseInsensitive: true,
  areUnbalancedEndTagsIgnored: true,
  areUnbalancedStartTagsImplicitlyClosed: true,
  implicitlyClosedTags: {
    h1: ['p'],
    h2: ['p'],
    li: ['li'],
    dt: ['dd', 'dt'],
    dd: ['dd', 'dt'],
  },
  foreignTags: {
    svg: {
      areCDATASectionsRecognized: true,
      areProcessingInstructionsRecognized: true,
      areSelfClosingTagsRecognized: true,
    },
    math: {
      areCDATASectionsRecognized: true,
      areProcessingInstructionsRecognized: true,
      areSelfClosingTagsRecognized: true,
    },
  },
});
```

# Performance

Execution performance is measured in operations per second (± 5%), the higher number is better.
Memory consumption (RAM) is measured in bytes, the lower number is better.

<table>
<tr>
<th align="right" valign="top" rowspan="2">Library</th>
<th align="right" valign="top" rowspan="2">Library size</th>
<th align="center" colspan="2">DOM parsing</th>
<th align="center" colspan="2">SAX parsing</th>
</tr>

<tr>
<td align="right">Ops/sec</td>
<td align="right">RAM</td>
<td align="right">Ops/sec</td>
<td align="right">RAM</td>
</tr>

<tr>
<td align="right">tag-soup&#x200B;@3.2.1</td>
<td align="right">
<a href="https://bundlephobia.com/package/tag-soup@3.0.0">21 kB</a>
</td>
<td align="right"><strong>35 Hz</strong></td>
<td align="right"><strong>22 MB</strong></td>
<td align="right"><strong>54 Hz</strong></td>
<td align="right"><strong>22 kB</strong></td>
</tr>

<tr>
<td align="right">
<a href="https://github.com/fb55/htmlparser2">htmlparser2</a>&#x200B;@12.0.0
</td>
<td align="right">
<a href="https://bundlephobia.com/package/htmlparser2@12.0.0">34 kB</a>
</td>
<td align="right">15 Hz</td>
<td align="right">35 MB</td>
<td align="right">24 Hz</td>
<td align="right">6 MB</td>
</tr>

<tr>
<td align="right">
<a href="https://github.com/inikulin/parse5">parse5</a>&#x200B;@8.0.0
</td>
<td align="right">
<a href="https://bundlephobia.com/package/parse5@8.0.0">45 kB</a>
</td>
<td align="right">7 Hz</td>
<td align="right">105 MB</td>
<td align="right">11 Hz</td>
<td align="right">10 MB</td>
</tr>

</table>

Performance was measured when parsing [the 3.64 MB HTML file](./src/test/test.html).

Tests were conducted using [TooFast](https://github.com/smikhalevski/toofast#readme) on Apple M1 with Node.js v25.6.0.

To reproduce [the performance test suite](./src/test/perf/overall.perf.js) results, clone this repo and run:

```shell
npm ci
npm run build
npm run perf
```

# Limitations

TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.

Assume the following markup:

<!-- prettier-ignore -->
```html
<p><strong>okay
<p>nope
```

With [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) this markup would be transformed to:

```html
<p><strong>okay</strong></p>
<p><strong>nope</strong></p>
```

TagSoup doesn't insert the second `strong` tag:

```html
<p><strong>okay</strong></p>
<p>nope</p>
```
