1 | When you've got a hosepipe, what you need is a Creek!
|
2 | =====================================================
|
3 | Creek is a dead simple way to run performant summary analysis over time series data. Creek is written in coffee-script and has only been tested with node 0.2.5.
|
4 |
|
5 | Current Status
|
6 | --------------
|
7 | It's early days for the project and there may be some warts however Creek is currently used in production at Forward to analyse over 35 million messages per day.
|
8 |
|
9 | The command line tool and configuration interface is fairly stable however the code level interfaces not so much.
|
10 |
|
11 | Hello World Example
|
12 | -------------------
|
13 |
|
14 | Get the code with either `npm install creek` or clone this repo and either add creek to your path or use ./bin/creek in place.
|
15 |
|
16 | then create a file somewhere called `hello.creek` containing the following...
|
17 |
|
18 | parser 'words'
|
19 | interface 'rest'
|
20 | track 'unique-words', aggregator: distinct.alltime
|
21 | track 'count', aggregator: count.alltime
|
22 |
|
23 | Then run `echo 'hello world from creek' | creek hello.creek`
|
24 |
|
25 | To see Creek in action run `curl "http://localhost:8080/"` from another window.
|
26 |
|
27 | Twitter Example
|
28 | ---------------
|
29 |
|
30 | As with the hello world example above but using this config instead...
|
31 |
|
32 | parser 'json', seperatedBy: '\r'
|
33 | interface 'rest'
|
34 |
|
35 | track 'languages-seen'
|
36 | aggregator: distinct.alltime
|
37 | field: (o) -> if o.user then o.user.lang else undefined
|
38 |
|
39 | track 'popular-words'
|
40 | aggregator: popular.timeboxed
|
41 | field: (o) -> if o.text then o.text.toLowerCase().split(' ') else undefined
|
42 | period: 60
|
43 | precision: 5
|
44 | top: 10
|
45 | before: (v) -> if v and v.length > 4 then v else undefined
|
46 |
|
47 | track 'popular-urls'
|
48 | aggregator: popular.timeboxed
|
49 | field: (o) -> if o.text then o.text.split(' ') else undefined
|
50 | period: 60*30
|
51 | precision: 60
|
52 | top: 5
|
53 | before: (v) -> if v and v.indexOf('http://') is 0 then v else undefined
|
54 |
|
55 | Then run `curl http://stream.twitter.com/1/statuses/sample.json -u USERNAME:PASSWORD | creek twitter.creek` and visit `http://localhost:8080/`
|
56 |
|
57 | Parsers
|
58 | -------
|
59 | You must choose a parser, the currently available options are...
|
60 |
|
61 | * words - this pushes each word to the aggregator using the current timestamp
|
62 | * json - expects line separated JSON objects
|
63 | * chunked - this is a generic parser which can chunk a stream based on any string or regex
|
64 | * zeromq - allows subscription to a zeromq channel rather than input on stdin.
|
65 |
|
66 | Currently all the parsers expect utf8 or a subset thereof.
|
67 |
|
68 | Interfaces
|
69 | ----------
|
70 | There is currently only one interface available...
|
71 |
|
72 | * rest - this makes a JSON rest api available running on localhost:8080 by default.
|
73 |
|
74 | Aggregators
|
75 | -----------
|
76 | The currently available aggregators are...
|
77 |
|
78 | * count.alltime
|
79 | * count.timeboxed
|
80 | * distinct.alltime
|
81 | * distinct.timeboxed
|
82 | * max.alltime
|
83 | * max.timeboxed
|
84 | * mean.alltime
|
85 | * mean.timeboxed
|
86 | * min.alltime
|
87 | * min.timeboxed
|
88 | * popular.timeboxed
|
89 | * recent.limited
|
90 | * sum.alltime
|
91 | * sum.timeboxed
|
92 |
|
93 | All aggregators support `field` and `before` options and timeboxed ones also support `period` and `precision` settings.
|
94 |
|
95 | * field - defaults to the whole object, can be a string in which case it uses that key on the object, or an integer in which case it uses that as a constant value, or a function which takes an object and returns a value to be used.
|
96 | * before - a function that takes each chunks field right before it is about to be pushed to the aggregator, you should return the original value, a modified value or undefined if you would like this chunk to be skipped.
|
97 | * period - this is the period in seconds over which you would like the timeboxed aggregation to run. A value of 60 will keep track of the value over the last rolling 1 min window. The default value is 60 seconds.
|
98 | * precision - the accuracy of the rolling time window specified in seconds. I general lower the value used here the more memory will be required. The default value is 1 second.
|
99 |
|
100 | Notes
|
101 | -----
|
102 | Please note that the current aggregator implementations are fairly immature and they may not be optimal in terms of RAM or CPU usage at this point.
|
103 | What I can say though is they are efficient enough for most use cases and Creek can comfortably handle dozens of aggregators running across 1,000s of records per second.
|
104 |
|
105 | If anyone would like to contribute interfaces, parsers or aggregators they would be gladly received.
|