1 | # Simple HTML Tokenizer [![Build Status](![CI](https://github.com/tildeio/simple-html-tokenizer/workflows/CI/badge.svg))](https://github.com/tildeio/simple-html-tokenizer/actions?query=workflow%3ACI)
2 |
3 | Simple HTML Tokenizer is a lightweight JavaScript library that can be
4 | used to tokenize the kind of HTML normally found in templates. It can be
5 | used to preprocess templates to change the behavior of some template
6 | element depending upon whether the template element was found in an
7 | attribute or text.
8 |
9 | It is not a full HTML5 tokenizer. It focuses on the kind of HTML that is
10 | used in templates: content designed to be inserted into the `<body>`
11 | and without `<script>` tags.
12 |
13 | In particular, Simple HTML Tokenizer does not handle many states from
14 | the [HTML5 Tokenizer Specification][1]:
15 |
16 | * Any states involving `CDATA` or `RCDATA`
17 | * Any states involving `<script>`
18 | * Any states involving `<DOCTYPE>`
19 | * The bogus comment state
20 |
21 | It also passes through character references, instead of trying to
22 | tokenize and process them, because the preprocessed templates will
23 | ultimately be parsed by a real browser context.
24 |
25 | At the moment, there are some error states specified by the tokenizer
26 | spec that are not handled by Simple HTML Tokenizer. Ultimately, I plan
27 | to support all error states, as well as provide information about
28 | tokenizer errors in debug mode.
29 |
30 | [1]: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
31 |
32 | # Usage
33 |
34 | You can tokenize HTML:
35 |
36 | ```js
37 | var tokens = HTML5Tokenizer.tokenize("<div id='foo' href=bar class=\"bat\">");
38 |
39 | var token = tokens[0];
40 | token.tagName //=> "div"
41 | token.attributes //=> [["id", "foo"], ["href", "bar"], ["class", "bat"]]
42 | token.selfClosing //=> false
43 | ```
44 |
45 | ## Building and running the tests
46 |
47 | ```bash
48 | npm install
49 | npm test
50 | ```