node-jl
=====

([Russian translation](README-RU.md))

Utility kit for working with JSON files/streams divided by `\n` symbol.
JSON-representation of each record shouldn't contain line breaking symbol.

Install
-----

```
npm install -g jl
```

Examples
-----

We have a log file `/tmp/test.json`, that contains next records
```json
{"ts": 1416595508, "type": "click"}
{"ts": 1416478467, "type": "buy", "price": 10}
{"ts": 1416466930, "type": "click"}
{"ts": 1416622653, "type": "buy", "price": 20}
{"ts": 1416699396, "type": "click"}
{"ts": 1416624334, "type": "click"}
{"ts": 1416518859, "type": "click"}
{"ts": 1416569870, "type": "click"}
{"ts": 1416573325, "type": "click"}
{"ts": 1416682270, "type": "click"}
```

### SQL-interface

`jl-sql` utility supports query language very similar to SQL
```sh
cat /tmp/test.json \
    | jl-sql 'SELECT type, COUNT(*) AS c, SUM(price) AS sum GROUP BY type ORDER BY c NUMERIC DESC'
```
```json
{"type":"click","c":8,"sum":0}
{"type":"buy","c":2,"sum":30}
```

As you can see we receive that we expect.

Please note non-standard keyword `NUMERIC` after `ORDER BY`:
without this keyword, sorting will be by string representation, not by numerical value.

#### Constants

- `NULL`
- `TRUE`
- `FALSE`

#### Operators

##### Arithmetic

- `+`
- `-`
- `*`
- `/`
- `%`

##### Comparison

- `=`, `==`
- `!=`
- `===` - Comparison with type check
- `!==` - Negative comparison with type check
- `>`
- `<`
- `>=`
- `<=`

##### Logical

- `AND`, `&&`
- `OR`, `||`
- `!`

#### Functions

##### Functions for working with data

- `FROM_UNIXTIME(unixTimestamp)` - converts unix timestamp to Date
- `UNIX_TIMESTAMP(date)` - converts Date to unix timestamp, date should be string in `'YYYY-MM-DD hh:mm:ss'` format
- `DATE(date)` - returns date in `'YYYY-MM-DD'` format

##### Misc

- `IF(expression, ifTrue, ifFalse)`
- `COALESCE(expression1[, expression2[, ...]])`

##### Aggregating

Right now, only very basic aggregating functions supported:

 - `SUM(expression)` - counts summary of field's values. Non-numerical values will be ignored
 - `MIN(expression)` - returns minimal field's value. Non-numerical values will be ignored
 - `MAX(expression)` - returns maximal field's value. Non-numerical values will be ignored
 - `COUNT(expression)` - counts amount of elements for which comparison with `expression` will be !== null or !== undefined
 - `COUNT(*)` - counts amount of elements
 - `HLL_COUNT_DISTINCT(arg1[, args2[, ...]])` - counts amount of unique argument combinations. Uses `O(1)` of memory and `O(N)` of CPU, but returns result with 0.1% error. For more information read [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm.

#### Restrictions

 - `JOIN` is not supported(and probably never will be)
 - Sorting by multiple fields is not supported
 - `LIMIT` is not supported, but could be replaced with `head` and `tail`

### Low-level interface

SQL-interface only wrapper for utility kit described down below.
By combining those utilities you can do much more than with SQL.

#### Filtering and aggregation

Let's filter only events with type `buy`
```sh
cat /tmp/test.json \
    | jl-filter 'type == "buy"'
```
```json
{"ts": 1416478467, "type": "buy", "price": 10}
{"ts": 1416622653, "type": "buy", "price": 20}
```

Add sum calculation for all `price` fields
```sh
cat /tmp/test.json \
    | jl-filter 'type == "buy"'
    | jl-sum price
```
```json
{"value":30}
```

To get just `30` you can add `jl-extract value` at the end - this command extract value of the field.

#### Sorting

Let sort log file by `ts` field
```sh
cat /tmp/test.json \
    | jl-sort ts
```
```json
{"ts": 1416466930, "type": "click"}
{"ts": 1416478467, "type": "buy", "price": 10}
{"ts": 1416518859, "type": "click"}
{"ts": 1416569870, "type": "click"}
{"ts": 1416573325, "type": "click"}
{"ts": 1416595508, "type": "click"}
{"ts": 1416622653, "type": "buy", "price": 20}
{"ts": 1416624334, "type": "click"}
{"ts": 1416682270, "type": "click"}
{"ts": 1416699396, "type": "click"}
```

The standard arguments of `sort` utility is also supported: `-r`, `-n`, `-u`, `-m`, `-s`, `-T`, `-S`

#### Modification

Let's add to each object field `date`, that contains event date in UTC format
```sh
cat /tmp/test.json \
    | jl-transform '{r.date = (new Date(r.ts * 1000)).toUTCString()}'
```
```json
{"ts":1416595508,"type":"click","date":"Fri, 21 Nov 2014 18:45:08 GMT"}
{"ts":1416478467,"type":"buy","price":10,"date":"Thu, 20 Nov 2014 10:14:27 GMT"}
{"ts":1416466930,"type":"click","date":"Thu, 20 Nov 2014 07:02:10 GMT"}
{"ts":1416622653,"type":"buy","price":20,"date":"Sat, 22 Nov 2014 02:17:33 GMT"}
{"ts":1416699396,"type":"click","date":"Sat, 22 Nov 2014 23:36:36 GMT"}
{"ts":1416624334,"type":"click","date":"Sat, 22 Nov 2014 02:45:34 GMT"}
{"ts":1416518859,"type":"click","date":"Thu, 20 Nov 2014 21:27:39 GMT"}
{"ts":1416569870,"type":"click","date":"Fri, 21 Nov 2014 11:37:50 GMT"}
{"ts":1416573325,"type":"click","date":"Fri, 21 Nov 2014 12:35:25 GMT"}
{"ts":1416682270,"type":"click","date":"Sat, 22 Nov 2014 18:51:10 GMT"}
```

Internal pipe
-----

The system pipes are inefficient because of serialization/deserialization of JSON each time when working with multiple `jl-`utilities because each utility receives and returns JSON using stdin/stdout.

To solve this problem `jl-`utilities supports internal piping which allows to multiple utilities interact with each other in the context of a single process without spending resources for parsing. To use internal pipes replace `|`  to `\|` between `jl-`commands in the command line. For example:
```sh
cat /tmp/test.json \
    | jl-filter 'type == "buy"'
    | jl-sum price
```
you can change to:
```sh
cat /tmp/test.json \
    | jl-filter 'type == "buy"'
    \| jl-sum price
```

Performance
-----

CPU consumption of all utilities except for `jl-sort` is `O(n)`, by memory is `O(1)`.

The performance of `jl-sort` completely depends on system's `sort` realization, additional CPU consumptions also `O(n)`, and by memory `O(1)`.

By using internal pipes significantly increases performance because JSON parsing is very expensive operation.
But in another hand, all operations in pipes is computing in a single thread, so with very long pipes you should keep the balance.

Utilities
-----

#### `jl-sql` - SQL

Requires single argument - SQL and has few options:

- `-T DIR` - temporary directory which will be used for sorting if it didn't use RAM. `$TMPDIR` or `/tmp` by default. Pay attention when sorting big amount of data, by having `/tmp` on the RAM disk.
- `-S BUF` - size of sorting buffer in bytes. On buffer overflow, filesystem will be used(see `-T` option)


#### `jl-sort` - sorting

Wrapper for GNU Sort, allowing to sort JSON. Uses all advantages of `sort`, also, supports sorting in filesystem, merge-sort, stable sorting, limit buffer size.

#### `jl-filter` - filtering

Alternative to `grep`, but for JSON.

#### `jl-reduce` - aggregation and grouping by key

General utility that produces welding elements into a single value for each group.
Expects 3 JS-functions arguments:

- `-i FUNC` - accumulator initialization: gets called for each group. The function executes in a group context.
- `-u FUNC` - accumulator update: the function that will be executed for each element of a group.
The function executes in a group context and receives argument `r`, containing each element of a thread.
- `-r FUNC` - received accumulator's value: will be executed at the end of group with the passed result. The function executes in a group context.

and one required argument

- `-k KEYDEF` - key which will be used for grouping, could be a function. For correct execution thread's input should be sorted by this key in any direction.

Here is simple example of using `jl-sum` utility, which counts sum for `amount` field for each user that gets identified by `uid` field:

```sh
jl-reduce -k uid -i '{this.sum = 0}' -u '{this.sum += r.amount}' -r '{return this.sum}'
```

For thread's input containing:
```json
{"uid": 1, "amount": 10}
{"uid": 1, "amount": 11}
{"uid": 2, "amount": 12}
{"uid": 3, "amount": 13}
```

Output will be:
```json
{"key":1,"value":21}
{"key":2,"value":12}
{"key":3,"value":13}
```

If `-k` argument will be missing, convolution will go over whole thread and will return:
```json
{"value":46}
```

#### `jl-sum` - summation

Predefined `jl-reduce`, which counts sum of parameter in group

#### `jl-count` - подсчёт количества

Predefined `jl-reduce`, which counts amount of elements in group

#### `jl-transform` - modification
#### `jl-extract` - field's value extraction
#### `jl-from-csv` - CSV to JSON conversation
#### `jl-plainify` - Flattens object to a single level deep k-v
