UNPKG

dyngodb2/README.md

Version:
21.6 kBMarkdownView Raw
1dyngodb2 [![Stories in Ready](https://badge.waffle.io/aaaristo/dyngodb.png)](http://waffle.io/aaaristo/dyngodb)
2========
3
4An **experiment** ([alpha](http://en.wikipedia.org/wiki/Software_release_life_cycle#Alpha)) to get a [MongoDB](http://www.mongodb.org/) *like* interface in front of [DynamoDB](http://aws.amazon.com/dynamodb/)
5and [CloudSearch](http://aws.amazon.com/cloudsearch/). Now supporting transactions as described by the [DynamoDB Transactions](https://github.com/awslabs/dynamodb-transactions/blob/master/DESIGN.md) protocol.
6
7In dyngodb2 we dropped the $ sign in favor of _. Also $version now is called _rev. The old branch is available [here](https://github.com/aaaristo/dyngodb/tree/dollar). Fixes to the old version will be released under the dyngodb npm package while the new releases are under the dyngodb2 npm package.
8
9
10## Why?
11
12DynamoDB is *elastic*, *cheap* and greatly integrated with many AWS products (e.g. [Elastic MapReduce](http://aws.amazon.com/elasticmapreduce/),
13[Redshift](http://aws.amazon.com/redshift/),[Data Pipeline](http://aws.amazon.com/datapipeline/),[S3](http://aws.amazon.com/s3/)),
14while MongoDB has a wonderful interface. Using node.js on [Elastic Beanstalk](http://aws.amazon.com/elasticbeanstalk/)
15and DynamoDB as your backend you could end-up with a very scalable, cheap and high available webapp architecture.
16The main stop on it for many developers would be being able to productively use DynamoDB, hence this project.
17
18## Getting started
19Playing around:
20<pre>
21$ npm install -g dyngodb2
22</pre>
23<pre>
24$ export AWS_ACCESS_KEY_ID=......
25$ export AWS_SECRET_ACCESS_KEY=......
26$ export AWS_REGION=eu-west-1
27$ dyngodb2
28> db.createCollection('test')
29> db.test.save({ name: 'John', lname: 'Smith' })
30> db.test.save({ name: 'Jane', lname: 'Burden' })
31> db.test.findOne({ name: 'John' })
32> john= last
33> john.city= 'London'
34> db.test.save(john)
35> db.test.find({ name: 'John' })
36> db.test.ensureIndex({ name: 'S' })
37> db.test.findOne({ name: 'John' })
38> db.test.ensureIndex({ $search: { domain: 'mycstestdomain', lang: 'en' } }); /* some CloudSearch */
39> db.test.update({ name: 'John' },{ $set: { city: 'Boston' } });
40> db.test.find({ $search: { q: 'Boston' } });
41> db.test.findOne({ name: 'Jane' }) /* some graphs */
42> jane= last
43> jane.husband= john
44> john.wife= jane
45> john.himself= john
46> db.test.save(john);
47> db.test.save(jane);
48> db.ensureTransactionTable(/*name*/) /* some transactions :) */
49> db.transaction()
50> tx.test.save({ name: 'i\'ll be rolled back :( ' })
51> tx.rollback(); /* your index is rolled back too */
52> db.transaction()
53> tx.test.save({ name: 'i\'ll be committed toghether with somenthing else' })
54> tx.test.save({ name: 'somenthing else' })
55> tx.commit(); /* your index is committed too */
56> db.test.remove()
57> db.test.drop()
58</pre>
59
60## Goals
61
62* support a MongoDB *like* query language
63
64* support slice and dice, Amazon EMR and be friendly to tools that integrates with DynamoDB
65  (so no compression of JSON objects for storage)
66
67* support graphs, and respect object identity
68
69* prevent lost-updates
70
71* support transactions ([DynamoDB Transactions](https://github.com/awslabs/dynamodb-transactions))
72
73* support fulltext search
74
75## What dyngodb actually does
76
77* Basic find() support (basic operators, no $all, $and $or..., some projection capabilities):
78  finds are implemented via 3 components:
79
80      * parser: parses the query and produce a "query" object that is 
81                used to track the state of the query from its beginning to the end.
82
83      * finder: tries to retrive less-possible data from DynamoDB in the fastest way
84
85      * refiner: "completes" the query, doing all the operations that finder was not able
86                 to perform (for lack of support in DynamoDB or because i simply 
87                 haven't found a better way).
88
89* Basic save() support: DynamoDB does not support sub-documents. So the approach here is to
90  save sub-documents as documents of the table and link them to the parent object like this:
91
92       <pre>
93       db.test.save({ name: 'John', wife: { name: 'Jane' } }) 
94       => 2 items inserted into the test table
95       1:      { _id: '50fb5b63-8061-4ccf-bbad-a77660101faa',
96                 name: 'John',
97                 __wife: '028e84d0-31a9-4f4c-abb6-c6177d85a7ff' }
98       2:      { _id: '028e84d0-31a9-4f4c-abb6-c6177d85a7ff',
99                 name: 'Jane' }
100       </pre>
101       
102       where _id is the HASH of the DynamoDB table. This enables us to respect the javascript object 
103       identity as it was in memory, and you will get the same structure - even if it where a cyrcular graph -
104       (actually with some addons _id, _rev...) when you query the data out:
105
106       db.test.find({ name: 'John' }) => will SCAN for name: 'John' return the first object, detects __wife
107       (__ for an object, ___ for an [array](#arrays)) and get (getItem) the second object. Those meta-attributes are kept
108       in the result for later use in save().
109
110* Basic update() support: $set, $unset (should add $push and $pull)
111
112* Basic lost update prevention
113
114### Finders
115
116There are 3 types of finders actually (used in this order):
117
118* Simple: manage _id queries, so the ones where the user specify the HASH of the DynamoDB table
119
120* Indexed: tries to find an index that is able to find hashes for that query
121
122* Scan: fails back to [Scan](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html)
123  the table :(, that you should try to avoid probably indexing fields,
124  or changing some design decision.
125
126### Indexing
127
128Indexes in dyngodb are DynamoDB tables that has a different KeySchema, and contains the data needed
129to lookup items based on some attributes. This means that typically an index will be used with a
130[Query](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html) operation.
131
132There are actually 2 indexes (4 but only 2 are used):
133
134* [fat](http://dictionary.reference.com/browse/fat).js: as the name suggests it is a pretty heavy 
135  "general purpose" index that will generate many additional writes: 1 for every field indexed + 1.
136  Lets see an example:
137
138  Suppose to have a table like this:
139  <pre>
140  { type: 'person', category: 'hipster', name: 'Jane', company: 'Acme' }
141  { type: 'person', category: 'hacker', name: 'Jack', city: 'Boston' }
142  { type: 'person', category: 'hustler', name: 'John', country: 'US' }
143  { type: 'company', category: 'enterprise', name: 'Amazon', phone: '13242343' }
144  { type: 'company', category: 'hipster', name: 'Plurimedia' }
145  </pre>
146  
147  And an index like:
148  <pre>
149  db.test.ensureIndex({ type: 'S', category: 'S', name: 'S' });  
150  </pre>
151  
152  The index will be used in queries like:
153  <pre>
154    db.test.find({ type: 'person' }).sort({ name: -1 })
155    db.test.find({ type: 'person', category: 'hipster' })
156  </pre>
157  
158  and will NOT be used in query like this
159  <pre>
160    db.test.find({ name: 'Jane' })
161    db.test.find({ category: 'hipster', name: 'Jane' })
162    db.test.find().sort({ name: -1 })
163  </pre>
164
165  and will be used partially (filter on type only) for this query:
166  <pre>
167    db.test.find({ type: 'person', name: 'Jane' })
168  </pre>
169  
170  So columns are ordered in the index and you can only use it starting with the first
171  and attaching the others as you defined it in ensureIndex() with an EQ operator or
172  the query (finder) will use the index until the first non EQ operator and then the refiner
173  will filter/sort the rest. Local secondary indexes are created an all indexed attributes,
174  to support non-EQ operators, that means that actually you can index only 5 attributes with
175  this kind of index.
176  
177  There is also some support for [$all](http://docs.mongodb.org/manual/reference/operator/query/all/) operator:
178  the fat index can index set fields, actually saving the combinations in the index so that you can query them.
179  ***NOTE***: if you have long set fields this will seriously impact write throughput.
180  
181  <pre>
182    db.test.save({ type: 'person', name: 'Jane', tags: ['hipster','hacker'] })
183    db.test.ensureIndex({ tags: 'SS' })
184    db.test.find({ tags: 'hipster' })
185    db.test.find({ tags: { $all: ['hipster','hacker'] } })
186  </pre>
187
188  Inspired by [Twitter Bloodhound](https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md), and 
189  [Google Code Search](http://swtch.com/~rsc/regexp/regexp4.html), i recently added a completely experimental $text 
190  field to the fat.js index so you can do fulltext searches without using CloudSearch, that is actually too 
191  expensive for small apps.
192  
193  <pre>
194    db.test.save({ type: 'person', name: 'Jane', about: 'she is the mom of two great childs', tags: ['hipster','hacker'] })
195    db.test.ensureIndex({ name: 'S', $text: function (item) { return _.pick(item,['name','about']); } })
196    db.test.find({ $text: 'mom childs' }) // should return Jane
197    db.test.find({ name: 'Jane', $text: 'mom child' }) // should return Jane, $text is chainable with other fields
198    db.test.find({ name: 'John', $text: 'mom child' }) // should return an empty resultset
199  </pre>
200  
201  As you can see you create a $text function in the ensureIndex to manage what fields you want the index to
202  make fulltext-searchable, then you can use a $text field in the query to specify your search terms. The values,
203  are whitespace tokenized and [trigrams](http://swtch.com/~rsc/regexp/regexp4.html) are created for every token,
204  so that each item that has a trigram is saved as a range of an hash for that trigram. When you query, the query
205  string is whitespace tokenized and for each token trigrams are computed, then any item having all those trigrams
206  is returned, the order of words in the query string is ignored. NOTE: This will not scale very well for items with 
207  large text to index unless you scale the writes for that index, you should balance the cost of your writes compared 
208  to the same usage on CloudSearch and pick what is best for you, also keep in mind that CloudSearch offers [a lot 
209  more](http://docs.aws.amazon.com/cloudsearch/latest/developerguide/what-is-cloudsearch.html): stemming, i18n, stoplists, results management... 
210  
211  
212  
213* cloud-search.js: is a fulltext index using AWS CloudSearch under the covers.
214
215  Suppose to have the same table as before.
216  And an index like:
217  <pre>
218  db.test.ensureIndex({ $search: { domain: 'test', lang: 'en' } });  
219  </pre>
220
221  You can then search the table like this:
222  <pre>
223  db.test.find({ $search: { q: 'Acme' } });
224  db.test.find({ $search: { bq: "type:'contact'", q: 'John' } });
225  </pre>
226
227
228* you could probably build your own specialized indexes too.. just copy the fat.js index and
229  add the new your.js index to the indexed.js finder at the top of indexes array.
230  (probably we should give this as a configuration option)
231
232
233### Lost update prevention
234
235Suppose to have two sessions going on
236
237Session 1 connects and read John
238<pre>
239$ dyngodb2
240> db.test.find({ name: 'John' })
241</pre>
242
243Session 2 connects and read John
244<pre>
245$ dyngodb2
246> db.test.find({ name: 'John' })
247</pre>
248
249Session 1 modifies and saves John
250<pre>
251> last.city= 'San Francisco'
252> db.test.save(last)
253done!
254</pre>
255
256Session 2 modifies and tries to save John and gets an error
257<pre>
258> last.country= 'France'
259> db.test.save(last)
260The item was changed since you read it
261</pre>
262
263This is accomplished by a _rev attribute which is incremented
264at save time if changes are detected in the object since it was read
265(_old attribute contains a clone of the item at read time).
266So when Session 2 tries to save the object it tries to save it
267[expecting](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html#DDB-PutItem-request-Expected) the item to have _old._rev in the table and it fails
268because Session 1 already incremented it.
269
270***note:*** when you get the above error you should reread the object you where trying to save,
271and eventually retry your updates, any other save operation on this object will result in bogus
272responses.
273
274### Arrays
275
276Actually dyngodb is pretty incoherent about arrays, infact it has two kinds of array persistence: 
277
278* DynamoDB supports sets which are basically javascript _unordered_ arrays of strings or numbers or binary data,
279  so if dyngodb detects an array of one of those types it persists it like a set (hence loosing its order):
280
281  <pre>
282    db.test.save({ name: 'John', tags: ['developer','hipster','hacker','cool'] })
283  </pre>
284
285
286* Object arrays _are kept in order_ (see [Schema](#schema)): 
287
288  <pre>
289    db.test.save({ name: 'John', sons: [{ name: 'Konrad' },{ name: 'Sam' },{ name: 'Jill' }] })
290  </pre>
291  
292  this is accomplished via the _pos RANGE attribute of the collection table. So saving the object above
293  would result in 4 items inserted in the DynamoDB table where 2 HASHes are generated (uuid):
294  
295  <pre>
296  1. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
297  2. { _id: 'uuid2', _pos: 0, name: 'Konrad' }
298  3. { _id: 'uuid2', _pos: 1, name: 'Sam' }
299  4. { _id: 'uuid2', _pos: 2, name: 'Jill' }
300  </pre>
301  Finding John would get you this structure:
302  
303  <pre>
304    db.test.find({ name: 'John' })
305    
306    { 
307      _id: 'uuid1',
308      _pos: 0,
309      name: 'John',
310      ___sons: 'uuid2',
311      sons: [
312               {
313                  _id: 'uuid2',
314                  _pos: 0,
315                  name: 'Konrad'
316               },
317               {
318                  _id: 'uuid2',
319                  _pos: 1,
320                  name: 'Sam'
321               },
322               {
323                  _id: 'uuid2',
324                  _pos: 2,
325                  name: 'Jill'
326               }
327            ]
328    }
329  </pre>
330  
331
332  This means that the array is strored within a single hash, with elements at different ranges,
333  which may be convinient to retrieve those objects if they live toghether with the parent object,
334  or as a list. Which is probably not true for sons...
335  So for the case where you "link" other objects inside the array, like:
336  
337  <pre>
338    konrad= { name: 'Konrad' };
339    sam= { name: 'Sam' };
340    jill= { name: 'Jill' };
341    db.test.save(konrad)
342    db.test.save(sam)
343    db.test.save(jill)
344    db.test.save({ name: 'John', sons: [konrad,sam,jill,{ name: 'Edward' }] })
345  </pre>
346  
347  here konrad, sam and jill are "standalone" objects with their hashes, that will be linked to the array,
348  while Edward will be contained in it. So in this case things are store like this:
349  
350  <pre>
351  1. { _id: 'konrad-uuid', _pos: 0, name: 'Konrad' }
352  2. { _id: 'sam-uuid', _pos: 0, name: 'Sam' }
353  3. { _id: 'jill-uuid', _pos: 0, name: 'Jill' }
354  4. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
355  5. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
356  6. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
357  7. { _id: 'uuid2', _pos: 0, _ref: 'konrad-uuid' }
358  8. { _id: 'uuid2', _pos: 1, _ref: 'sam-uuid' }
359  9. { _id: 'uuid2', _pos: 2, _ref: 'jill-uuid' }
360  10. { _id: 'uuid2', _pos: 3, name: 'Edward' }
361  </pre>
362
363  Now you see the _ref here and you probably understand what is going on. Dyngo stores array placeholders
364  for objects that *lives* in other hashes. Obviously, finding John you will get the right structure:
365  
366  <pre>
367    db.test.find({ name: 'John' })
368    
369    { 
370      _id: 'uuid1',
371      _pos: 0,
372      name: 'John',
373      ___sons: 'uuid2',
374      sons: [
375               {
376                  _id: 'konrad-uuid',
377                  _pos: 0, // dereferenced from _ref so you get the standalone object with 0 _pos
378                  name: 'Konrad'
379               },
380               {
381                  _id: 'sam-uuid',
382                  _pos: 0,
383                  name: 'Sam'
384               },
385               {
386                  _id: 'jill-uuid',
387                  _pos: 0,
388                  name: 'Jill'
389               },
390               {
391                  _id: 'uuid2',
392                  _pos: 3,
393                  name: 'Jill'
394               }
395            ]
396    }
397  </pre>
398
399
400* Arrays of arrays or other type of: they don't work. actually never tested it.
401
402  <pre>
403    db.test.save({ name: 'John', xxx: [[{},{}],[{}],[{}]] })
404    db.test.save({ name: 'John', xxx: [{},[{}],2] })
405  </pre>
406
407### Schema
408
409In dyngodb you have 3 DynamoDB table KeySchema:
410
411* the one used for collections where you have _id (string) as the HASH attribute and _pos (number) as the range attribute.
412  _id, if not specified in the object, is autogenerated with an UUID V4. _pos is always 0 for objects not contained in
413  an array, and is the position of the object in the array for objects contained in arrays (see [Arrays](#arrays)).
414
415* the one used for indexes where you have _hash (string) as HASH attribute and _range (string) as the range attribute.
416  _hash represents probably the container of the results for a certain operator. and _range is used to keep the key
417  attributes of the results (_id+':'+_pos).
418
419* the one used for transaction tables where you have _id (string) as HASH attribute and _item (string) as the range attribute. _id represents the transaction id. _item can be (_) the transaction header item, or some kind of item copy:
420
421  1. the target item to put when the transaction should be applied (target::table::hash attr::hash value::range attr::range value)
422  2. the copy of the item to rollback (copy::table::hash attr::hash value::range attr::range value)
423
424Some automatically generated attributes:
425
426* _id: the object identifier (if not set manually)
427* _pos: the object position in an array
428* _rev: the revision of the object
429* _refs: an array of _id/s referred by the object (indexable as a string set see fat.js)
430* ___&lt;attr name&gt;: placeholders for arrays
431* __&lt;attr name&gt;: placeholders for objects
432* _tx: the transaction locking the item
433* _txLocked: the time when the transaction locked the item
434* _txApplied: the transaction has already modified this item, but is not committed
435* _txTransient: the transaction inserted the item to lock it
436* _txDeleted: the transaction is going to delete the item on commit
437
438
439### Transactions
440
441In dyngodb2 there is basic support for transactions take a look at the [tests] (https://github.com/aaaristo/dyngodb/blob/master/test/transaction.test.js). 
442It is an initial implementation of the protocol described [here](https://github.com/awslabs/dynamodb-transactions/blob/master/DESIGN.md).
443All the db.* APIs are still *non-transactional*, while tx.* APIs: that are really 
444the same as db.* behaves in a transactional way. Once you get a transaction by
445calling db.transaction() you can operate on any number of tables/items and any modification
446you do is committed or rolledback with the others performed in the same transaction 
447(this is true also for items generated by indexes like fat.js, while cloud-search.js
448fulltext search is still non-transactional). 
449
450***Keep in mind that this is completely experimental at this stage.***
451
452### Local
453
454It is possible to use [DynamoDB Local](http://aws.typepad.com/aws/2013/09/dynamodb-local-for-desktop-development.html) by adding *--local* to the commandline:
455<pre>
456dyngodb --local
457</pre>
458
459### .dyngorc
460
461Using the *.dyngorc* file you can issue some commands before using the console (e.g. ensureIndex)
462
463### standard input (argv by [optimist](https://github.com/substack/node-optimist))
464
465*commands.txt*
466<pre>
467db.test.save([{ name: argv.somename },{ name: 'Jane' }])
468db.test.save([{ name: 'John' },{ name: 'Jane' }])
469db.test.save([{ name: 'John' },{ name: 'Jane' }])
470</pre>
471
472<pre>
473dyngodb2 --somename Jake  &lt; commands.txt
474</pre>
475
476### Streams (for raw dynamodb items)
477
478Example of moving items between tables with streams (10 by 10):
479<pre>
480dyngodb2
481> t1= db._dyn.stream('table1')
482> t2= db._dyn.stream('table2')
483> t1.scan({ limit: 10 }).pipe(t2.mput('put')).on('finish',function () { console.log('done'); })
484</pre>
485
486### basic CSV (todo: stream)
487
488Example of loading a csv file (see [node-csv](https://github.com/wdavidw/node-csv) for options)
489<pre>
490dyngodb2
491> csv('my/path/to.csv',{ delimiter: ';', escape: '"' },['id','name','mail'])
492> last
493> db.mytbl.save(last)
494</pre>
495
496### basic XLSX
497
498Example of loading an xlsx file
499<pre>
500dyngodb2
501> workbook= xlsx('my/path/to.xlsx') 
502> contacts= workbook.sheet('Contacts').toJSON(['id','name','mail'])
503> db.mytbl.save(contacts)
504</pre>
505
506### Provisioned Throughput
507
508You can increase the througput automatically (on tables and indexes),
509dyngodb will go through the required steps until it reaches
510the required value.
511
512<pre>
513dyngodb2
514> db.mytbl.modify(1024,1024)
515> db.mytbl.indexes[0].modify(1024,1024)
516</pre>
517
518### Export / import (todo: stream)
519
520Export:
521<pre>
522dyngodb2
523> db.mytbl.find()
524> db.cleanup(last).clean(function (d) { gson('export.gson',d); });
525</pre>
526
527Import:
528<pre>
529dyngodb2
530> db.mytbl.save(gson('export.gson'));
531</pre>
532
533You can use either json or [gson](https://github.com/aaaristo/GSON) function the only difference is that the gson function
534is able to serialize cyrcular object graphs in a non-recursive way.
535
536### Q&D migration from dyngodb to dyngodb2
537
538<pre>
539dyngodb
540> db.mytbl.find()
541> db.cleanup(last).clean(function (d) { gson('export.gson',d); });
542cat export.gson | sed 's/"$id"\:/"_id":/g' > export2.gson
543dyngodb2
544> db.mytbl.save(gson('export2.gson'));
545</pre>
546
547Things you may need to update:
548
549  * your .dyngorc 
550  * your code $id -> _id (e.g. sed -i '.old' 's/$id/_id/g' *)
551
552### AngularJS and Express integration
553
554Check: [https://github.com/aaaristo/angular-gson-express-dyngodb](https://github.com/aaaristo/angular-gson-express-dyngodb)
555
556### Help wanted!
557
558Your help is highly appreciated: we need to test / discuss / fix code, performance, roadmap
559
560
561[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/aaaristo/dyngodb/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
562