# Bloody Roots

A JavaScript engine for recursive descent parsing of context-free grammars, written in CoffeeScript.

## Usage

1. `{ Parser } = require('bloodyroots')`
2. Create a new class extending from `Parser`, defining one or more productions. The `Document` production is required, as it will be the parse engine's entry point into the grammar.
3. Instantiate the derived parser class.
4. Execute the parse method on this instance. It will return a DOM tree on successful parse, or undefined on failure.

## Terminology

Context free grammars use productions V of the form α→β, where α is a single non-terminal and β is a sequence of terminals and non-terminals; throughout the rest of this document, β refers to some tree of grammar operations, defined below. Please see http://en.wikipedia.org/wiki/Context-free_grammar for more detailed information on context-free grammars.

`vdata` refers to an object that tracks parametric information for parsing according to the β tree of a production V, typically containing the capture group information for previous regular expression matches.

## Grammar operations

It is useful to understand regular expressions, as the grammar operations in this parser correspond to regular expression operations.

-   `@alternation(β_1, β_2, ..., β_n)` or `@alternation( [β_1, β_2, ..., β_n], β_suffix)`: Equivalent to the `...|...|...` from regular expressions: given a sequence of βs, it produces a parse tree from the first `β_i` that parses. Unlike regular expressions, however, the default behavior is to commit to a particular `β_i` from the given sequence and not backtrack if what follows does not parse. So, for example, while `(aaa|aa)ab` will match the string `aaab`, the β tree

		@seq(@alternation(@re('aaa'), @re('aa')), @re('ab'))

    will not parse `aaab` because `@re('aaa')` will have matched and the alternation considered committed before checking whether `@re('ab')` matches. The regular expression-like behavior can be achieved by specifying the alternation's β sequence as an array in the first argument and the suffix β tree as the second argument:

		@alternation([@re('aaa'), @re('aa')], @re('ab'))

    Note, however, that this will result in backtracking and so is less efficient than writing a different set of productions for the same grammar that does not require backtracking.

    When `β_suffix` is specified, returns a DOM tree node of `type: 'seq'` with the output from parsing according to the matching `β_i` as the first element and the output of parsing according to `β_suffix` as the second element of an array under key `seq`; when `suffix` is not specified, returns precisely what parsing `β_i` returned.

-   `@range(β, min=0, max, greedy=true, β_suffix)`: Equivalent to `(...){n,m}` from regular expressions: it matches at least `n` and at most `m` repetitions of the given `β`. As with `@alternation`, one needs to specify a `β_suffix` to force backtracking if the range match should not commit after finding the minimum `>= min` (non-greedy) or the maximum `<= max` (greedy) number of consecutive matches. There are several shorthands based on `@range`:

    -   `@at_least_one(β, β_suffix)` is equivalent to `@range(β, 1, undefined, true, β_suffix)`: think `+` from regular expressions.
    -   `@zero_or_more(β, β_suffix)` is equivalent to `@range(β, 0, undefined, true, β_suffix)`: think `*` from regular expressions.
    -   `@zero_or_one(β, β_suffix)` is equivalent to `@range(β, 0, 1, true, β_suffix)`: think `?` from regular expressions (as applied to a terminal, not as applied to force non-greediness)

    Note that using non-greediness without a suffix always returns either `min` matches or fails to parse.

    Returns a DOM tree node of `type: 'seq'` with an array of the output from each `β` parsing iteration (plus the matched `β_suffix`, if any) under key `seq`.

-   `@re(re_str, match_name)`: Matches the given regular expression. If `match_name` is specified, then the resulting match array (index `0` being the full matched string, index `n>=1` being the `n`th capture group) is assigned the given name. See `@var_re` and `@backref` for the use of this named match array.

    Returns a DOM tree node of `type: 're'` with the entire string under key `match` and the named groups (including the entire string again under index 0) as an array under key `groups`.

-   `@seq(β_1, β_2, ..., β_n)`: Parses according to the given `β_i`s in order.

    Returns a DOM tree node of `type: 'seq'` with the array of matching βs under key `seq`.

-   `@transform(f, β)`: Transforms the DOM tree resulting from parsing according to `β` using the function `f`, which takes the DOM tree, `vdata`, and the string index of the parsed substring as arguments. `this` is set to the instance of the parser.

    The transform should not modify the input parse tree, as the original may be part of a cached entry (see [Optimizations](#optimizations) for more info) and so cause unintended behavior elsewhere in the parser.

    Returns whatever the transform returns; `undefined` is considered a failure to parse.

-   `@v(α_name, argf)`: Parses the production named `α_name`, as defined by `@define_production(α_name, ...)`.

    If `argf` is specified, calls `argf.call(this, vdata)` and assigns the return value to `vdata.arg` (which is passed to the grammar operations within the β tree for this production). The purpose of this second argument is to make the receiving productions variable functions of this parameter, specifically to pass backreferences from an earlier `@re` match to another production.

    If, for instance, one wishes to parse XML without needing to know all schema-valid tags in advance, one needs to match a close tag to the tag that opened it. This can be performed with a non-greedy alternation, but a more efficient way to accomplish this is to use greediness combined with a forward-looking negative regular expression match on the open tag, such as in the BBCode parser from the test code:

		@define_production('Element',
		  @seq(
		    @re('\\[([A-Za-z]*)(?:=([^\\]]*))?\\]', 'opentag'),
		    @zero_or_more(
		      @alternation(
		        @v('Element'),
		        @v('NotSpecificCloseTagOrText', @backref('opentag[1]'))))),
		    @var_re('\\[/\\=opentag[1]\\]'))))

		@define_production('NotSpecificCloseTagOrText',
		  @transform(text, @var_re('\\[/(?!\\=arg[0]\\])[^\\]]*\\]|\\[(?:[^/][^\\]]*)?\\]|[^\\[]+')))

    In the first step of the `@seq`, `Element` searches for an open tag and assigns the name of the tag to `vdata.opentag`. In the second step it then parses zero or more `Element`s or `NotSpecificCloseTagOrText`s: the first case parses a full subelement (such as `[b]...[/b]`); the second case, however, parses any text or close tag *except* the close tag that matches `vdata.opentag[1]` (as `vdata.arg[0]` from `NotSpecificCloseTag`'s `vdata`). In that case, the `@zero_or_more` is done and the `@var_re` matches the specific close tag in the final step of the `@seq`.

    Clearly, this could all be done within the `Element` production itself, but one can imagine another use of `NotSpecificCloseTagOrText` that might motivate breaking it out into its own production.

    See `@var_re` for information on how the argument is inserted into the regular expression.

    Returns precisely what the production for the given `α_name` would return.

-   `@var_re(re_str, match_name)`: Same as `@re`, but replaces all instances of `=arg[idx]` with `vdata.arg[idx]`. Note that this regular expression cannot be precompiled and so `@var_re` is less efficient than `@re`.

## Other methods

-   `@backref(match_name[idx])`: Returns the value of `vdata.match_name[idx]`, i.e., either:
    1. The capture group with index `idx` within the match array named `match_name` by an `@re` or `@var_re` call within the β tree for this production; or
    2. When `match_name == 'arg'`, the value `vdata.arg[idx]` where `vdata.arg` was set by the second argument to `@v`.

    Currently, the only use for this operation is as the second argument to `@v`, to pass a backreference into another production.

## The DOM tree

An object tree representing the parser output. In each node, standard keys always present are `pos` and `length`, indicating respectively the starting index of and the length of the input string whose parsed output is represented by this subtree; and `type`, indicating the type of element represented by this node. Other keys for a node depend on the output of the grammar operation used to parse it.

## Suggestions

Specify `@debug = true` in your class definition to turn on debug logging for instances of that parser: this should help with development of tricky grammars.

See `test/parser.coffee` for a good example of how to implement a grammar: `BBCodeParser` is an early version of what I wrote this code for. It parses a variant of BBCode that I use in my old Perl-based forum software that is now long in the tooth and probably full of security holes. (Sssh! Don't tell anyone.)

## <a id="optimizations"></a>Optimizations

If, as a result of backtracking, a particular production is parsed multiple times at the same string position with the same `vdata`, the second and subsequent attempts will return the previously cached result instead of re-parsing the string. This caching is evident in the full debug log.

## To do

1. Clean up the code and structure as I learn Node.js better. Please provide feedback if something is not working as it should within the ecosystem: it has been difficult to piece together the "right" way to do things through the results of Google searches.

## License

The MIT License

Copyright (c) 2013 Kyle Rose

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


