UNPKG

4.37 kBMarkdownView Raw
1When you've got a hosepipe, what you need is a Creek!
2=====================================================
3Creek is a dead simple way to run performant summary analysis over time series data. Creek is written in coffee-script and has only been tested with node 0.2.5.
4
5Current Status
6--------------
7It's early days for the project and there may be some warts however Creek is currently used in production at Forward to analyse over 35 million messages per day.
8
9The command line tool and configuration interface is fairly stable however the code level interfaces not so much.
10
11Hello World Example
12-------------------
13
14Get the code with either `npm install creek` or clone this repo and either add creek to your path or use ./bin/creek in place.
15
16then create a file somewhere called `hello.creek` containing the following...
17
18 parser 'words'
19 interface 'rest'
20 track 'unique-words', aggregator: distinct.alltime
21 track 'count', aggregator: count.alltime
22
23Then run `echo 'hello world from creek' | creek hello.creek`
24
25To see Creek in action run `curl "http://localhost:8080/"` from another window.
26
27Twitter Example
28---------------
29
30As with the hello world example above but using this config instead...
31
32 parser 'json', seperatedBy: '\r'
33 interface 'rest'
34
35 track 'languages-seen'
36 aggregator: distinct.alltime
37 field: (o) -> if o.user then o.user.lang else undefined
38
39 track 'popular-words'
40 aggregator: popular.timeboxed
41 field: (o) -> if o.text then o.text.toLowerCase().split(' ') else undefined
42 period: 60
43 precision: 5
44 top: 10
45 before: (v) -> if v and v.length > 4 then v else undefined
46
47 track 'popular-urls'
48 aggregator: popular.timeboxed
49 field: (o) -> if o.text then o.text.split(' ') else undefined
50 period: 60*30
51 precision: 60
52 top: 5
53 before: (v) -> if v and v.indexOf('http://') is 0 then v else undefined
54
55Then run `curl http://stream.twitter.com/1/statuses/sample.json -u USERNAME:PASSWORD | creek twitter.creek` and visit `http://localhost:8080/`
56
57Parsers
58-------
59You must choose a parser, the currently available options are...
60
61* words - this pushes each word to the aggregator using the current timestamp
62* json - expects line separated JSON objects
63* chunked - this is a generic parser which can chunk a stream based on any string or regex
64* zeromq - allows subscription to a zeromq channel rather than input on stdin.
65
66Currently all the parsers expect utf8 or a subset thereof.
67
68Interfaces
69----------
70There is currently only one interface available...
71
72* rest - this makes a JSON rest api available running on localhost:8080 by default.
73
74Aggregators
75-----------
76The currently available aggregators are...
77
78* count.alltime
79* count.timeboxed
80* distinct.alltime
81* distinct.timeboxed
82* max.alltime
83* max.timeboxed
84* mean.alltime
85* mean.timeboxed
86* min.alltime
87* min.timeboxed
88* popular.timeboxed
89* recent.limited
90* sum.alltime
91* sum.timeboxed
92
93All aggregators support `field` and `before` options and timeboxed ones also support `period` and `precision` settings.
94
95* field - defaults to the whole object, can be a string in which case it uses that key on the object, or an integer in which case it uses that as a constant value, or a function which takes an object and returns a value to be used.
96* before - a function that takes each chunks field right before it is about to be pushed to the aggregator, you should return the original value, a modified value or undefined if you would like this chunk to be skipped.
97* period - this is the period in seconds over which you would like the timeboxed aggregation to run. A value of 60 will keep track of the value over the last rolling 1 min window. The default value is 60 seconds.
98* precision - the accuracy of the rolling time window specified in seconds. I general lower the value used here the more memory will be required. The default value is 1 second.
99
100Notes
101-----
102Please note that the current aggregator implementations are fairly immature and they may not be optimal in terms of RAM or CPU usage at this point.
103What I can say though is they are efficient enough for most use cases and Creek can comfortably handle dozens of aggregators running across 1,000s of records per second.
104
105If anyone would like to contribute interfaces, parsers or aggregators they would be gladly received.