re-build
Version:
Building regular expressions with natural language
220 lines (130 loc) • 9.1 kB
Markdown
RE-Build reference
==================
# `RegExp` builders
The object obtained from building a regular expressions *builders*. Builders are augmented with members and methods to build the regex further, but they're basically immutable objects as every call to extend the builder returns a *new* builder instance.
## Properties
All the following properties are read-only.
Type | Name | Description
-------:|--------------|-------------
string | `regex` | The regular expression defined by the builder. It's compiled the first time the property is requested, then cached
string | `source` | The source of the underlying regular expression. Used to compile it
string | `flags` | A string comprising the regex' flags. It may include one or more of the letters `"g"`, `"m"`, `"i"`, `"u"` or `"y"`
boolean | `global` | The regex' `global` flag
boolean | `ignoreCase` | The regex' `ignoreCase` flag
boolean | `multiline` | The regex' `multiline` flag
boolean | `unicode` | The regex' `unicode` flag
boolean | `sticky` | The regex' `sticky` flag
## Methods
Returns | Name | Description
--------:|------------------|-------------------------
`RegExp` | `toRegExp()` | Basically, returns the `regex` property
`RegExp` | `valueOf()` | See above
string | `toString()` | Returns a string representation
boolean | `test(string)` | Uses the underlying regex to test a string. Short for `.regex.test(...)`
array | `exec(string)` | Executes the underlying regex on a string. Short for `.regex.exec(...)`
string | `replace(string, string/function)` | Uses the underlying regex to perform a regex-based replacement. Short for `string.replace(regex, ...)`
array | `split(string)` | Uses the underlying regex to perform a regex-based string split. Short for `string.split(regex)`
number | `search(string)` | Uses the underlying regex to perform a string search. Short for `string.search(regex)`
# Building a regex
Regex building begins from the he `RE` object returned by the module. You can obtain a *builder* every time you use "words" like `digit`, `then` and such. Some of these words act like functions (like `atLeast` and `codePoint`), some like properties (like `digit` and `theEnd`), some work as both.
In this last case, if the word is not used as a function, additional words are expected to obtain a builder:
```js
var foo = RE.matching.digit.then.alphaNumeric;
```
Many words that can (or must) be used as functions accept a variable number of arguments, that can be either strings, or regular expressions, or builders, which are all appended to the source. Strings are backslash-escaped, while in the other cases the `source` property is then added *unescaped*:
```js
var amount = RE.oneOrMore.digit.then(".").then.digit.then.digit,
currency = /[$€£]/;
var builder = RE.matching.theStart
.then("Total: ", amount, currency)
.then.theEnd;
```
Other words that work as functions only usually accept other types of arguments.
## Flags
The flags of a builder (and its underlying regular expression) can be set using words starting from the `RE` object. After one of these words, another flag word or `matching` must follow, with the exception of `withFlags` that must be followed by `matching` only.
* **`globally`**
Set the `global` flag on.
* **`anyCase`**
Set the `ignoreCase` flag on.
* **`fullText`**
Set the `multiline` flag on.
* **`stickily`**
Set the `sticky` flag on.
* **`withUnicode`**
Set the `unicode` flag on.
* **`withFlags(flags)`**
Set multiple flags. `flags` is expected to be a string containing letters in the set `"g"`, `"m"`, `"i"` and `"y"`.
## Conjunctions
Conjunctions append additional blocks to the current source. They can follow any open or set block.
* **`then`**
Appends a block to the current source.
* **`or`**
Adds an alternative block (prefixed by the pipe `|` character in regular expressions).
## Open and set blocks
These words can be used in both "open" sequences or inside character sets. They can be used after conjunction words, or a quantifier, or the `matching` word, or the `RE` object itself, or the `and` word joining blocks in character sets.
* **`digit` / `not.digit`**
A digit character (`\d`) or its negation (`\D`).
* **`alphaNumeric` / `not.alphaNumeric`**
An alphanumeric character plus the undescore (`\w`) or its negation (`\W`).
* **`whiteSpace` / `not.whisteSpace`**
A whitespace (`\s`) or its negation (`\W`).
* **`cReturn`** `\r`
* **`newLine`** `\n`
* **`tab`** `\t`
* **`vTab`** `\v`
* **`formFeed`** `\f`
* **`null`** `\0`
* **`slash`** `\/`
* **`backslash`** `\\`
* **`ascii(code)`**
An ASCII escape sequence (`\xhh`). `code` must be an integer between 0 and 255. It it then converted as two hexadecimal digits in the sequence.
* **`codePoint(code, ...)`**
An Unicode escape sequence (`\uhhhh`, or `\u{hhhhh}` with the `unicode` flag set and with a code not from the [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode))). `code` must be an integer between 0 and 1114111 (`0x10ffff`) or a `RangeError` will be thrown; or it can be a string, whose code points will be converted in the corresponding Unicode escape sequence. Keep in mind that code points from astral planes, when the `unicode` flag is *not* set, are encoded in the corresponding surrogate code point pairs (e.g.: `"🍰"` will become `"\ud83c\udf70"`): *it is your duty* to wrap the pairs in a group if needed or, when it's not possible (for example, in a character range) using an adequate regex structure.
* **`control(letter)`**
A control sequence (`\cx`). `letter` must be a string of a single letter. It is then converted to uppercase in the sequence.
## Open-only blocks
These words can be used in open block sequences only (which means, not inside character sets). They can be used after conjunction words, or a quantifier, or the `matching` word, or the `RE` object itself.
* **`anyChar`**
The universal character (`.`).
* **`theStart` / `theEnd`**
The string-start and string-end boundaries (`^` and `$`, respectively).
* **`wordBoundary` / `not.wordBoundary`**
A word boundary (`\b`) or its negation (`\B`).
* **`oneOf` / `not.oneOf`**
Appends a character set (`[...]` or `[^...]`, respectively). See the paragraph about [character sets](#character-sets).
* **`group(...)`**
Non-capturing group - `(?:...)`. Used as functions only. Arguments can be strings, regexes or builders.
* **`capture(...)`**
Capturing group - `(...)`. Used as functions only. Arguments can be strings, regexes or builders.
* **`reference(number)`**
Group backreference (`\number`). `number` should be a positive integer.
## Character sets
Character sets are introduced by the `oneOf` word, and may include one or more blocks separated by the `and` word (e.g.: `RE.oneOf.digit.and("abcdef")`).
These words can be used in character sets only:
* **`range(start, end)`**
Adds a character interval into the character set (`[...start-end...]`). `start` and `end` are supposed to be strings of single characters defining the boundaries of the character range; or they can be builders that define one single character, or character class usable in character ranges (which include: `ascii`, `unicode`, `control`, `newLine`, `cReturn`, `tab`, `vTab`, `formFeed`, `null`).
* **`backspace`**
The backspace character, `\b` (U+0008). Not to be confused with the word boundary, which can be used as an "open" block only.
## Quantifiers
Quantifiers can follow conjunction words, or the `matching` word, or the `RE` object itself, and can precede any "open" block, with the exception of `wordBoundary`, `not.wordBoundary`, `theStart` and `theEnd`.
They can be prefixed by `lazily` to define a lazy quantifier, instead of a greedy one.
Quantifiers can be used as functions, and accept strings, regexes or builders as arguments. A convenient group wrap will be used if necessary:
```js
var foo = RE.oneOrMore("a"); // /a+/
var bar = RE.oneOrMore("abc"); // /(?:abc)+/
```
* **`anyAmountOf`** `*`
* **`oneOrMore`** `+`
* **`noneOrOne`** `?`
* **`atLeast(n)`**
`n` must be a non-negative integer. If `n` is 0, a `*` is produced; if `n` is 1, then `+` is produced; else, the quantifier is `{n,}`.
* **`atMost(n)`**
`n` must be a non-negative integer. If `n` is 1, then `?` is produced; else, the quantifier is `{,n}`.
* **`exactly(n)`**
`n` must be a non-negative integer. If `n` is 1, then no quantifier is defined; else, the quantifier is `{n}`.
* **`between(n, m)`**
`n` and `m` must be non-negative integers. If the the values are adequate, the produced quantifier can be one of the above; otherwise, the quantifier is `{n,m}`.
## Look-aheads
* **`followedBy(...)` / `not.followedBy(...)`**
Appends a look-ahead (`(?=...)` or `(?!...)`, respectively). Used as functions only. Arguments can be strings, regexes or builders.
Can follow any open block, or the `matching` word, or the `RE` object itself, or the `or` conjunction.