1 | Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.png?branch=master)](https://travis-ci.org/lydell/js-tokens)
|
2 | ========
|
3 |
|
4 | A regex that tokenizes JavaScript.
|
5 |
|
6 | ```js
|
7 | var jsTokens = require("js-tokens")
|
8 |
|
9 | var jsString = "var foo=opts.foo;\n..."
|
10 |
|
11 | jsString.match(jsTokens)
|
12 | // ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
|
13 | ```
|
14 |
|
15 |
|
16 | Installation
|
17 | ============
|
18 |
|
19 | - `npm install js-tokens`
|
20 |
|
21 | ```js
|
22 | var jsTokens = require("js-tokens")
|
23 | ```
|
24 |
|
25 |
|
26 | Usage
|
27 | =====
|
28 |
|
29 | ### `jsTokens` ###
|
30 |
|
31 | A regex with the `g` flag that matches JavaScript tokens.
|
32 |
|
33 | The regex _always_ matches, even invalid JavaScript and the empty string.
|
34 |
|
35 | The next match is always directly after the previous.
|
36 |
|
37 | ### `var token = jsTokens.matchToToken(match)` ###
|
38 |
|
39 | Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
|
40 | String, value: String}` object. The following types are available:
|
41 |
|
42 | - string
|
43 | - comment
|
44 | - regex
|
45 | - number
|
46 | - name
|
47 | - punctuator
|
48 | - whitespace
|
49 | - invalid
|
50 |
|
51 | Multi-line comments and strings also have a `closed` property indicating if the
|
52 | token was closed or not (see below).
|
53 |
|
54 | Comments and strings both come in several flavors. To distinguish them, check if
|
55 | the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
|
56 |
|
57 | Names are ECMAScript IdentifierNames, that is, including both identifiers and
|
58 | keywords. You may use [is-keyword-js] to tell them apart.
|
59 |
|
60 | Whitespace includes both line terminators and other whitespace.
|
61 |
|
62 | For example usage, please see this [gist].
|
63 |
|
64 | [is-keyword-js]: https://github.com/crissdev/is-keyword-js
|
65 | [gist]: https://gist.github.com/lydell/be49dbf80c382c473004
|
66 |
|
67 |
|
68 | Invalid code handling
|
69 | =====================
|
70 |
|
71 | Unterminated strings are still matched as strings. JavaScript strings cannot
|
72 | contain (unescaped) newlines, so unterminated strings simply end at the end of
|
73 | the line. Unterminated template strings can contain unescaped newlines, though,
|
74 | so they go on to the end of input.
|
75 |
|
76 | Unterminated multi-line comments are also still matched as comments. They
|
77 | simply go on to the end of the input.
|
78 |
|
79 | Unterminated regex literals are likely matched as division and whatever is
|
80 | inside the regex.
|
81 |
|
82 | Invalid ASCII characters have their own capturing group.
|
83 |
|
84 | Invalid non-ASCII characters are treated as names, to simplify the matching of
|
85 | names (except unicode spaces which are treated as whitespace).
|
86 |
|
87 | Regex literals may contain invalid regex syntax. They are still matched as
|
88 | regex literals. They may also contain repeated regex flags, to keep the regex
|
89 | simple.
|
90 |
|
91 | Strings may contain invalid escape sequences.
|
92 |
|
93 |
|
94 | Limitations
|
95 | ===========
|
96 |
|
97 | Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
|
98 | perfect. But that’s not the point either.
|
99 |
|
100 | You may compare jsTokens with [esprima] by using `esprima-compare.js`.
|
101 | See `npm run esprima-compare`!
|
102 |
|
103 | [esprima]: http://esprima.org/
|
104 |
|
105 | ### Template string interpolation ###
|
106 |
|
107 | Template strings are matched as single tokens, from the starting `` ` `` to the
|
108 | ending `` ` ``, including interpolations (whose tokens are not matched
|
109 | individually).
|
110 |
|
111 | Matching template string interpolations requires recursive balancing of `{` and
|
112 | `}`—something that JavaScript regexes cannot do. Only one level of nesting is
|
113 | supported.
|
114 |
|
115 | ### Division and regex literals collision ###
|
116 |
|
117 | Consider this example:
|
118 |
|
119 | ```js
|
120 | var g = 9.82
|
121 | var number = bar / 2/g
|
122 |
|
123 | var regex = / 2/g
|
124 | ```
|
125 |
|
126 | A human can easily understand that in the `number` line we’re dealing with
|
127 | division, and in the `regex` line we’re dealing with a regex literal. How come?
|
128 | Because humans can look at the whole code to put the `/` characters in context.
|
129 | A JavaScript regex cannot. It only sees forwards.
|
130 |
|
131 | When the `jsTokens` regex scans throught the above, it will see the following
|
132 | at the end of both the `number` and `regex` rows:
|
133 |
|
134 | ```js
|
135 | / 2/g
|
136 | ```
|
137 |
|
138 | It is then impossible to know if that is a regex literal, or part of an
|
139 | expression dealing with division.
|
140 |
|
141 | Here is a similar case:
|
142 |
|
143 | ```js
|
144 | foo /= 2/g
|
145 | foo(/= 2/g)
|
146 | ```
|
147 |
|
148 | The first line divides the `foo` variable with `2/g`. The second line calls the
|
149 | `foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
|
150 | sees forwards, it cannot tell the two cases apart.
|
151 |
|
152 | There are some cases where we _can_ tell division and regex literals apart,
|
153 | though.
|
154 |
|
155 | First off, we have the simple cases where there’s only one slash in the line:
|
156 |
|
157 | ```js
|
158 | var foo = 2/g
|
159 | foo /= 2
|
160 | ```
|
161 |
|
162 | Regex literals cannot contain newlines, so the above cases are correctly
|
163 | identified as division. Things are only problematic when there are more than
|
164 | one non-comment slash in a single line.
|
165 |
|
166 | Secondly, not every character is a valid regex flag.
|
167 |
|
168 | ```js
|
169 | var number = bar / 2/e
|
170 | ```
|
171 |
|
172 | The above example is also correctly identified as division, because `e` is not a
|
173 | valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
|
174 | (any letter) as flags, but it is not worth it since it increases the amount of
|
175 | ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
|
176 | allowed. This means that the above example will be identified as division as
|
177 | long as you don’t rename the `e` variable to some permutation of `gmiyu` 1 to 5
|
178 | characters long.
|
179 |
|
180 | Lastly, we can look _forward_ for information.
|
181 |
|
182 | - If the token following what looks like a regex literal is not valid after a
|
183 | regex literal, but is valid in a division expression, then the regex literal
|
184 | is treated as division instead. For example, a flagless regex cannot be
|
185 | followed by a string, number or name, but all of those three can be the
|
186 | denominator of a division.
|
187 | - Generally, if what looks like a regex literal is followed by an operator, the
|
188 | regex literal is treated as division instead. This is because regexes are
|
189 | seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
|
190 | could likely be part of such an expression.
|
191 |
|
192 | Please consult the regex source and the test cases for precise information on
|
193 | when regex or division is matched (should you need to know). In short, you
|
194 | could sum it up as:
|
195 |
|
196 | If the end of a statement looks like a regex literal (even if it isn’t), it
|
197 | will be treated as one. Otherwise it should work as expected (if you write sane
|
198 | code).
|
199 |
|
200 |
|
201 | License
|
202 | =======
|
203 |
|
204 | [The X11 (“MIT”) License](LICENSE).
|