UNPKG

re-build

Version:

Building regular expressions with natural language

220 lines (130 loc) 9.1 kB
RE-Build reference ================== # `RegExp` builders The object obtained from building a regular expressions *builders*. Builders are augmented with members and methods to build the regex further, but they're basically immutable objects as every call to extend the builder returns a *new* builder instance. ## Properties All the following properties are read-only. Type | Name | Description -------:|--------------|------------- string | `regex` | The regular expression defined by the builder. It's compiled the first time the property is requested, then cached string | `source` | The source of the underlying regular expression. Used to compile it string | `flags` | A string comprising the regex' flags. It may include one or more of the letters `"g"`, `"m"`, `"i"`, `"u"` or `"y"` boolean | `global` | The regex' `global` flag boolean | `ignoreCase` | The regex' `ignoreCase` flag boolean | `multiline` | The regex' `multiline` flag boolean | `unicode` | The regex' `unicode` flag boolean | `sticky` | The regex' `sticky` flag ## Methods Returns | Name | Description --------:|------------------|------------------------- `RegExp` | `toRegExp()` | Basically, returns the `regex` property `RegExp` | `valueOf()` | See above string | `toString()` | Returns a string representation boolean | `test(string)` | Uses the underlying regex to test a string. Short for `.regex.test(...)` array | `exec(string)` | Executes the underlying regex on a string. Short for `.regex.exec(...)` string | `replace(string, string/function)` | Uses the underlying regex to perform a regex-based replacement. Short for `string.replace(regex, ...)` array | `split(string)` | Uses the underlying regex to perform a regex-based string split. Short for `string.split(regex)` number | `search(string)` | Uses the underlying regex to perform a string search. Short for `string.search(regex)` # Building a regex Regex building begins from the he `RE` object returned by the module. You can obtain a *builder* every time you use "words" like `digit`, `then` and such. Some of these words act like functions (like `atLeast` and `codePoint`), some like properties (like `digit` and `theEnd`), some work as both. In this last case, if the word is not used as a function, additional words are expected to obtain a builder: ```js var foo = RE.matching.digit.then.alphaNumeric; ``` Many words that can (or must) be used as functions accept a variable number of arguments, that can be either strings, or regular expressions, or builders, which are all appended to the source. Strings are backslash-escaped, while in the other cases the `source` property is then added *unescaped*: ```js var amount = RE.oneOrMore.digit.then(".").then.digit.then.digit, currency = /[$€£]/; var builder = RE.matching.theStart .then("Total: ", amount, currency) .then.theEnd; ``` Other words that work as functions only usually accept other types of arguments. ## Flags The flags of a builder (and its underlying regular expression) can be set using words starting from the `RE` object. After one of these words, another flag word or `matching` must follow, with the exception of `withFlags` that must be followed by `matching` only. * **`globally`** Set the `global` flag on. * **`anyCase`** Set the `ignoreCase` flag on. * **`fullText`** Set the `multiline` flag on. * **`stickily`** Set the `sticky` flag on. * **`withUnicode`** Set the `unicode` flag on. * **`withFlags(flags)`** Set multiple flags. `flags` is expected to be a string containing letters in the set `"g"`, `"m"`, `"i"` and `"y"`. ## Conjunctions Conjunctions append additional blocks to the current source. They can follow any open or set block. * **`then`** Appends a block to the current source. * **`or`** Adds an alternative block (prefixed by the pipe `|` character in regular expressions). ## Open and set blocks These words can be used in both "open" sequences or inside character sets. They can be used after conjunction words, or a quantifier, or the `matching` word, or the `RE` object itself, or the `and` word joining blocks in character sets. * **`digit` / `not.digit`** A digit character (`\d`) or its negation (`\D`). * **`alphaNumeric` / `not.alphaNumeric`** An alphanumeric character plus the undescore (`\w`) or its negation (`\W`). * **`whiteSpace` / `not.whisteSpace`** A whitespace (`\s`) or its negation (`\W`). * **`cReturn`** `\r` * **`newLine`** `\n` * **`tab`** `\t` * **`vTab`** `\v` * **`formFeed`** `\f` * **`null`** `\0` * **`slash`** `\/` * **`backslash`** `\\` * **`ascii(code)`** An ASCII escape sequence (`\xhh`). `code` must be an integer between 0 and 255. It it then converted as two hexadecimal digits in the sequence. * **`codePoint(code, ...)`** An Unicode escape sequence (`\uhhhh`, or `\u{hhhhh}` with the `unicode` flag set and with a code not from the [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode))). `code` must be an integer between 0 and 1114111 (`0x10ffff`) or a `RangeError` will be thrown; or it can be a string, whose code points will be converted in the corresponding Unicode escape sequence. Keep in mind that code points from astral planes, when the `unicode` flag is *not* set, are encoded in the corresponding surrogate code point pairs (e.g.: `"🍰"` will become `"\ud83c\udf70"`): *it is your duty* to wrap the pairs in a group if needed or, when it's not possible (for example, in a character range) using an adequate regex structure. * **`control(letter)`** A control sequence (`\cx`). `letter` must be a string of a single letter. It is then converted to uppercase in the sequence. ## Open-only blocks These words can be used in open block sequences only (which means, not inside character sets). They can be used after conjunction words, or a quantifier, or the `matching` word, or the `RE` object itself. * **`anyChar`** The universal character (`.`). * **`theStart` / `theEnd`** The string-start and string-end boundaries (`^` and `$`, respectively). * **`wordBoundary` / `not.wordBoundary`** A word boundary (`\b`) or its negation (`\B`). * **`oneOf` / `not.oneOf`** Appends a character set (`[...]` or `[^...]`, respectively). See the paragraph about [character sets](#character-sets). * **`group(...)`** Non-capturing group - `(?:...)`. Used as functions only. Arguments can be strings, regexes or builders. * **`capture(...)`** Capturing group - `(...)`. Used as functions only. Arguments can be strings, regexes or builders. * **`reference(number)`** Group backreference (`\number`). `number` should be a positive integer. ## Character sets Character sets are introduced by the `oneOf` word, and may include one or more blocks separated by the `and` word (e.g.: `RE.oneOf.digit.and("abcdef")`). These words can be used in character sets only: * **`range(start, end)`** Adds a character interval into the character set (`[...start-end...]`). `start` and `end` are supposed to be strings of single characters defining the boundaries of the character range; or they can be builders that define one single character, or character class usable in character ranges (which include: `ascii`, `unicode`, `control`, `newLine`, `cReturn`, `tab`, `vTab`, `formFeed`, `null`). * **`backspace`** The backspace character, `\b` (U+0008). Not to be confused with the word boundary, which can be used as an "open" block only. ## Quantifiers Quantifiers can follow conjunction words, or the `matching` word, or the `RE` object itself, and can precede any "open" block, with the exception of `wordBoundary`, `not.wordBoundary`, `theStart` and `theEnd`. They can be prefixed by `lazily` to define a lazy quantifier, instead of a greedy one. Quantifiers can be used as functions, and accept strings, regexes or builders as arguments. A convenient group wrap will be used if necessary: ```js var foo = RE.oneOrMore("a"); // /a+/ var bar = RE.oneOrMore("abc"); // /(?:abc)+/ ``` * **`anyAmountOf`** `*` * **`oneOrMore`** `+` * **`noneOrOne`** `?` * **`atLeast(n)`** `n` must be a non-negative integer. If `n` is 0, a `*` is produced; if `n` is 1, then `+` is produced; else, the quantifier is `{n,}`. * **`atMost(n)`** `n` must be a non-negative integer. If `n` is 1, then `?` is produced; else, the quantifier is `{,n}`. * **`exactly(n)`** `n` must be a non-negative integer. If `n` is 1, then no quantifier is defined; else, the quantifier is `{n}`. * **`between(n, m)`** `n` and `m` must be non-negative integers. If the the values are adequate, the produced quantifier can be one of the above; otherwise, the quantifier is `{n,m}`. ## Look-aheads * **`followedBy(...)` / `not.followedBy(...)`** Appends a look-ahead (`(?=...)` or `(?!...)`, respectively). Used as functions only. Arguments can be strings, regexes or builders. Can follow any open block, or the `matching` word, or the `RE` object itself, or the `or` conjunction.