UNPKG

10.1 kBMarkdownView Raw
1# Regular Expression Tokenizer
2
3Tokenizes strings that represent a regular expressions.
4
5![Depfu](https://img.shields.io/depfu/fent/ret.js)
6[![codecov](https://codecov.io/gh/fent/ret.js/branch/master/graph/badge.svg)](https://codecov.io/gh/fent/ret.js)
7
8# Usage
9
10```js
11const ret = require('ret');
12
13let tokens = ret(/foo|bar/.source);
14```
15
16`tokens` will contain the following object
17
18```js
19{
20 "type": ret.types.ROOT
21 "options": [
22 [ { "type": ret.types.CHAR, "value", 102 },
23 { "type": ret.types.CHAR, "value", 111 },
24 { "type": ret.types.CHAR, "value", 111 } ],
25 [ { "type": ret.types.CHAR, "value", 98 },
26 { "type": ret.types.CHAR, "value", 97 },
27 { "type": ret.types.CHAR, "value", 114 } ]
28 ]
29}
30```
31
32# Reconstructing Regular Expressions from Tokens
33
34The `reconstruct` function accepts an *any* token and returns, as a string, the *component* of the regular expression that is associated with that token.
35
36```ts
37import { reconstruct, types } from 'ret'
38const tokens = ret(/foo|bar/.source)
39const setToken = {
40 "type": types.SET,
41 "set": [
42 { "type": types.CHAR, "value": 97 },
43 { "type": types.CHAR, "value": 98 },
44 { "type": types.CHAR, "value": 99 }
45 ],
46 "not": true
47 }
48reconstruct(tokens) // 'foo|bar'
49reconstruct({ "type": types.CHAR, "value": 102 }) // 'f'
50reconstruct(setToken) // '^abc'
51```
52
53# Token Types
54
55`ret.types` is a collection of the various token types exported by ret.
56
57### ROOT
58
59Only used in the root of the regexp. This is needed due to the posibility of the root containing a pipe `|` character. In that case, the token will have an `options` key that will be an array of arrays of tokens. If not, it will contain a `stack` key that is an array of tokens.
60
61```js
62{
63 "type": ret.types.ROOT,
64 "stack": [token1, token2...],
65}
66```
67
68```js
69{
70 "type": ret.types.ROOT,
71 "options" [
72 [token1, token2...],
73 [othertoken1, othertoken2...]
74 ...
75 ],
76}
77```
78
79### GROUP
80
81Groups contain tokens that are inside of a parenthesis. If the group begins with `?` followed by another character, it's a special type of group. A ':' tells the group not to be remembered when `exec` is used. '=' means the previous token matches only if followed by this group, and '!' means the previous token matches only if NOT followed.
82
83Like root, it can contain an `options` key instead of `stack` if there is a pipe.
84
85```js
86{
87 "type": ret.types.GROUP,
88 "remember" true,
89 "followedBy": false,
90 "notFollowedBy": false,
91 "stack": [token1, token2...],
92}
93```
94
95```js
96{
97 "type": ret.types.GROUP,
98 "remember" true,
99 "followedBy": false,
100 "notFollowedBy": false,
101 "options" [
102 [token1, token2...],
103 [othertoken1, othertoken2...]
104 ...
105 ],
106}
107```
108
109### POSITION
110
111`\b`, `\B`, `^`, and `$` specify positions in the regexp.
112
113```js
114{
115 "type": ret.types.POSITION,
116 "value": "^",
117}
118```
119
120### SET
121
122Contains a key `set` specifying what tokens are allowed and a key `not` specifying if the set should be negated. A set can contain other sets, ranges, and characters.
123
124```js
125{
126 "type": ret.types.SET,
127 "set": [token1, token2...],
128 "not": false,
129}
130```
131
132### RANGE
133
134Used in set tokens to specify a character range. `from` and `to` are character codes.
135
136```js
137{
138 "type": ret.types.RANGE,
139 "from": 97,
140 "to": 122,
141}
142```
143
144### REPETITION
145
146```js
147{
148 "type": ret.types.REPETITION,
149 "min": 0,
150 "max": Infinity,
151 "value": token,
152}
153```
154
155### REFERENCE
156
157References a group token. `value` is 1-9.
158
159```js
160{
161 "type": ret.types.REFERENCE,
162 "value": 1,
163}
164```
165
166### CHAR
167
168Represents a single character token. `value` is the character code. This might seem a bit cluttering instead of concatenating characters together. But since repetition tokens only repeat the last token and not the last clause like the pipe, it's simpler to do it this way.
169
170```js
171{
172 "type": ret.types.CHAR,
173 "value": 123,
174}
175```
176
177## Errors
178
179ret.js will throw errors if given a string with an invalid regular expression. All possible errors are
180
181* Invalid group. When a group with an immediate `?` character is followed by an invalid character. It can only be followed by `!`, `=`, or `:`. Example: `/(?_abc)/`
182* Nothing to repeat. Thrown when a repetitional token is used as the first token in the current clause, as in right in the beginning of the regexp or group, or right after a pipe. Example: `/foo|?bar/`, `/{1,3}foo|bar/`, `/foo(+bar)/`
183* Unmatched ). A group was not opened, but was closed. Example: `/hello)2u/`
184* Unterminated group. A group was not closed. Example: `/(1(23)4/`
185* Unterminated character class. A custom character set was not closed. Example: `/[abc/`
186
187# Regular Expression Syntax
188
189Regular expressions follow the [JavaScript syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp).
190
191The following latest JavaScript additions are not supported yet:
192* `\p` and `\P`: [Unicode property escapes](https://github.com/tc39/proposal-regexp-unicode-property-escapes)
193* `(?<group>)` and `\k<group>`: [Named groups](https://github.com/tc39/proposal-regexp-named-groups)
194* `(?<=)` and `(?<!)`: [Negative lookbehind assertions](https://github.com/tc39/proposal-regexp-lookbehind)
195
196# Examples
197
198`/abc/`
199
200```js
201{
202 "type": ret.types.ROOT,
203 "stack": [
204 { "type": ret.types.CHAR, "value": 97 },
205 { "type": ret.types.CHAR, "value": 98 },
206 { "type": ret.types.CHAR, "value": 99 }
207 ]
208}
209```
210
211`/[abc]/`
212
213```js
214{
215 "type": ret.types.ROOT,
216 "stack": [{
217 "type": ret.types.SET,
218 "set": [
219 { "type": ret.types.CHAR, "value": 97 },
220 { "type": ret.types.CHAR, "value": 98 },
221 { "type": ret.types.CHAR, "value": 99 }
222 ],
223 "not": false
224 }]
225}
226```
227
228`/[^abc]/`
229
230```js
231{
232 "type": ret.types.ROOT,
233 "stack": [{
234 "type": ret.types.SET,
235 "set": [
236 { "type": ret.types.CHAR, "value": 97 },
237 { "type": ret.types.CHAR, "value": 98 },
238 { "type": ret.types.CHAR, "value": 99 }
239 ],
240 "not": true
241 }]
242}
243```
244
245`/[a-z]/`
246
247```js
248{
249 "type": ret.types.ROOT,
250 "stack": [{
251 "type": ret.types.SET,
252 "set": [
253 { "type": ret.types.RANGE, "from": 97, "to": 122 }
254 ],
255 "not": false
256 }]
257}
258```
259
260`/\w/`
261
262```js
263// Similar logic for `\W`, `\d`, `\D`, `\s` and `\S`
264{
265 "type": ret.types.ROOT,
266 "stack": [{
267 "type": ret.types.SET,
268 "set": [{
269 { "type": ret.types.CHAR, "value": 95 },
270 { "type": ret.types.RANGE, "from": 97, "to": 122 },
271 { "type": ret.types.RANGE, "from": 65, "to": 90 },
272 { "type": ret.types.RANGE, "from": 48, "to": 57 }
273 }],
274 "not": false
275 }]
276}
277```
278
279`/./`
280
281```js
282// any character but CR, LF, U+2028 or U+2029
283{
284 "type": ret.types.ROOT,
285 "stack": [{
286 "type": ret.types.SET,
287 "set": [
288 { "type": ret.types.CHAR, "value": 10 },
289 { "type": ret.types.CHAR, "value": 13 },
290 { "type": ret.types.CHAR, "value": 8232 },
291 { "type": ret.types.CHAR, "value": 8233 }
292 ],
293 "not": true
294 }]
295}
296```
297
298`/a*/`
299
300```js
301{
302 "type": ret.types.ROOT,
303 "stack": [{
304 "type": ret.types.REPETITION,
305 "min": 0,
306 "max": Infinity,
307 "value": { "type": ret.types.CHAR, "value": 97 }
308 }]
309}
310```
311
312`/a+/`
313
314```js
315{
316 "type": ret.types.ROOT,
317 "stack": [{
318 "type": ret.types.REPETITION,
319 "min": 1,
320 "max": Infinity,
321 "value": { "type": ret.types.CHAR, "value": 97 },
322 }]
323}
324```
325
326`/a?/`
327
328```js
329{
330 "type": ret.types.ROOT,
331 "stack": [{
332 "type": ret.types.REPETITION,
333 "min": 0,
334 "max": 1,
335 "value": { "type": ret.types.CHAR, "value": 97 }
336 }]
337}
338```
339
340`/a{3}/`
341
342```js
343{
344 "type": ret.types.ROOT,
345 "stack": [{
346 "type": ret.types.REPETITION,
347 "min": 3,
348 "max": 3,
349 "value": { "type": ret.types.CHAR, "value": 97 }
350 }]
351}
352```
353
354`/a{3,5}/`
355
356```js
357{
358 "type": ret.types.ROOT,
359 "stack": [{
360 "type": ret.types.REPETITION,
361 "min": 3,
362 "max": 5,
363 "value": { "type": ret.types.CHAR, "value": 97 }
364 }]
365}
366```
367
368`/a{3,}/`
369
370```js
371{
372 "type": ret.types.ROOT,
373 "stack": [{
374 "type": ret.types.REPETITION,
375 "min": 3,
376 "max": Infinity,
377 "value": { "type": ret.types.CHAR, "value": 97 }
378 }]
379}
380```
381
382`/(a)/`
383
384```js
385{
386 "type": ret.types.ROOT,
387 "stack": [{
388 "type": ret.types.GROUP,
389 "stack": { "type": ret.types.CHAR, "value": 97 },
390 "remember": true
391 }]
392}
393```
394
395`/(?:a)/`
396
397```js
398{
399 "type": ret.types.ROOT,
400 "stack": [{
401 "type": ret.types.GROUP,
402 "stack": { "type": ret.types.CHAR, "value": 97 },
403 "remember": false
404 }]
405}
406```
407
408`/(?=a)/`
409
410```js
411{
412 "type": ret.types.ROOT,
413 "stack": [{
414 "type": ret.types.GROUP,
415 "stack": { "type": ret.types.CHAR, "value": 97 },
416 "remember": false,
417 "followedBy": true
418 }]
419}
420```
421
422`/(?!a)/`
423
424```js
425{
426 "type": ret.types.ROOT,
427 "stack": [{
428 "type": ret.types.GROUP,
429 "stack": { "type": ret.types.CHAR, "value": 97 },
430 "remember": false,
431 "notFollowedBy": true
432 }]
433}
434```
435
436`/a|b/`
437
438```js
439{
440 "type": ret.types.ROOT,
441 "options": [
442 [{ "type": ret.types.CHAR, "value": 97 }],
443 [{ "type": ret.types.CHAR, "value": 98 }]
444 ]
445}
446```
447
448`/(a|b)/`
449
450```js
451{
452 "type": ret.types.ROOT,
453 "stack": [
454 "type": ret.types.GROUP,
455 "remember": true,
456 "options": [
457 [{ "type": ret.types.CHAR, "value": 97 }],
458 [{ "type": ret.types.CHAR, "value": 98 }]
459 ]
460 ]
461}
462```
463
464`/^/`
465
466```js
467{
468 "type": ret.types.ROOT,
469 "stack": [{
470 "type": ret.types.POSITION,
471 "value": "^"
472 }]
473}
474```
475
476`/$/`
477
478```js
479{
480 "type": ret.types.ROOT,
481 "stack": [{
482 "type": ret.types.POSITION,
483 "value": "$"
484 }]
485}
486```
487
488`/\b/`
489
490```js
491{
492 "type": ret.types.ROOT,
493 "stack": [{
494 "type": ret.types.POSITION,
495 "value": "b"
496 }]
497}
498```
499
500`/\B/`
501
502```js
503{
504 "type": ret.types.ROOT,
505 "stack": [{
506 "type": ret.types.POSITION,
507 "value": "B"
508 }]
509}
510```
511
512`/\1/`
513
514```js
515{
516 "type": ret.types.ROOT,
517 "stack": [{
518 "type": ret.types.REFERENCE,
519 "value": 1
520 }]
521}
522```
523
524# Install
525
526 npm install ret
527
528
529# Tests
530
531Tests are written with [vows](http://vowsjs.org/)
532
533```bash
534npm test
535```
536
537# Security
538
539To report a security vulnerability, please use the [Tidelift security contact](https://tidelift.com/security). Tidelift will coordinate the fix and disclosure.