UNPKG

21.6 kBMarkdownView Raw
1dyngodb2 [![Stories in Ready](https://badge.waffle.io/aaaristo/dyngodb.png)](http://waffle.io/aaaristo/dyngodb)
2========
3
4An **experiment** ([alpha](http://en.wikipedia.org/wiki/Software_release_life_cycle#Alpha)) to get a [MongoDB](http://www.mongodb.org/) *like* interface in front of [DynamoDB](http://aws.amazon.com/dynamodb/)
5and [CloudSearch](http://aws.amazon.com/cloudsearch/). Now supporting transactions as described by the [DynamoDB Transactions](https://github.com/awslabs/dynamodb-transactions/blob/master/DESIGN.md) protocol.
6
7In dyngodb2 we dropped the $ sign in favor of _. Also $version now is called _rev. The old branch is available [here](https://github.com/aaaristo/dyngodb/tree/dollar). Fixes to the old version will be released under the dyngodb npm package while the new releases are under the dyngodb2 npm package.
8
9
10## Why?
11
12DynamoDB is *elastic*, *cheap* and greatly integrated with many AWS products (e.g. [Elastic MapReduce](http://aws.amazon.com/elasticmapreduce/),
13[Redshift](http://aws.amazon.com/redshift/),[Data Pipeline](http://aws.amazon.com/datapipeline/),[S3](http://aws.amazon.com/s3/)),
14while MongoDB has a wonderful interface. Using node.js on [Elastic Beanstalk](http://aws.amazon.com/elasticbeanstalk/)
15and DynamoDB as your backend you could end-up with a very scalable, cheap and high available webapp architecture.
16The main stop on it for many developers would be being able to productively use DynamoDB, hence this project.
17
18## Getting started
19Playing around:
20<pre>
21$ npm install -g dyngodb2
22</pre>
23<pre>
24$ export AWS_ACCESS_KEY_ID=......
25$ export AWS_SECRET_ACCESS_KEY=......
26$ export AWS_REGION=eu-west-1
27$ dyngodb2
28> db.createCollection('test')
29> db.test.save({ name: 'John', lname: 'Smith' })
30> db.test.save({ name: 'Jane', lname: 'Burden' })
31> db.test.findOne({ name: 'John' })
32> john= last
33> john.city= 'London'
34> db.test.save(john)
35> db.test.find({ name: 'John' })
36> db.test.ensureIndex({ name: 'S' })
37> db.test.findOne({ name: 'John' })
38> db.test.ensureIndex({ $search: { domain: 'mycstestdomain', lang: 'en' } }); /* some CloudSearch */
39> db.test.update({ name: 'John' },{ $set: { city: 'Boston' } });
40> db.test.find({ $search: { q: 'Boston' } });
41> db.test.findOne({ name: 'Jane' }) /* some graphs */
42> jane= last
43> jane.husband= john
44> john.wife= jane
45> john.himself= john
46> db.test.save(john);
47> db.test.save(jane);
48> db.ensureTransactionTable(/*name*/) /* some transactions :) */
49> db.transaction()
50> tx.test.save({ name: 'i\'ll be rolled back :( ' })
51> tx.rollback(); /* your index is rolled back too */
52> db.transaction()
53> tx.test.save({ name: 'i\'ll be committed toghether with somenthing else' })
54> tx.test.save({ name: 'somenthing else' })
55> tx.commit(); /* your index is committed too */
56> db.test.remove()
57> db.test.drop()
58</pre>
59
60## Goals
61
62* support a MongoDB *like* query language
63
64* support slice and dice, Amazon EMR and be friendly to tools that integrates with DynamoDB
65 (so no compression of JSON objects for storage)
66
67* support graphs, and respect object identity
68
69* prevent lost-updates
70
71* support transactions ([DynamoDB Transactions](https://github.com/awslabs/dynamodb-transactions))
72
73* support fulltext search
74
75## What dyngodb actually does
76
77* Basic find() support (basic operators, no $all, $and $or..., some projection capabilities):
78 finds are implemented via 3 components:
79
80 * parser: parses the query and produce a "query" object that is
81 used to track the state of the query from its beginning to the end.
82
83 * finder: tries to retrive less-possible data from DynamoDB in the fastest way
84
85 * refiner: "completes" the query, doing all the operations that finder was not able
86 to perform (for lack of support in DynamoDB or because i simply
87 haven't found a better way).
88
89* Basic save() support: DynamoDB does not support sub-documents. So the approach here is to
90 save sub-documents as documents of the table and link them to the parent object like this:
91
92 <pre>
93 db.test.save({ name: 'John', wife: { name: 'Jane' } })
94 => 2 items inserted into the test table
95 1: { _id: '50fb5b63-8061-4ccf-bbad-a77660101faa',
96 name: 'John',
97 __wife: '028e84d0-31a9-4f4c-abb6-c6177d85a7ff' }
98 2: { _id: '028e84d0-31a9-4f4c-abb6-c6177d85a7ff',
99 name: 'Jane' }
100 </pre>
101
102 where _id is the HASH of the DynamoDB table. This enables us to respect the javascript object
103 identity as it was in memory, and you will get the same structure - even if it where a cyrcular graph -
104 (actually with some addons _id, _rev...) when you query the data out:
105
106 db.test.find({ name: 'John' }) => will SCAN for name: 'John' return the first object, detects __wife
107 (__ for an object, ___ for an [array](#arrays)) and get (getItem) the second object. Those meta-attributes are kept
108 in the result for later use in save().
109
110* Basic update() support: $set, $unset (should add $push and $pull)
111
112* Basic lost update prevention
113
114### Finders
115
116There are 3 types of finders actually (used in this order):
117
118* Simple: manage _id queries, so the ones where the user specify the HASH of the DynamoDB table
119
120* Indexed: tries to find an index that is able to find hashes for that query
121
122* Scan: fails back to [Scan](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html)
123 the table :(, that you should try to avoid probably indexing fields,
124 or changing some design decision.
125
126### Indexing
127
128Indexes in dyngodb are DynamoDB tables that has a different KeySchema, and contains the data needed
129to lookup items based on some attributes. This means that typically an index will be used with a
130[Query](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html) operation.
131
132There are actually 2 indexes (4 but only 2 are used):
133
134* [fat](http://dictionary.reference.com/browse/fat).js: as the name suggests it is a pretty heavy
135 "general purpose" index that will generate many additional writes: 1 for every field indexed + 1.
136 Lets see an example:
137
138 Suppose to have a table like this:
139 <pre>
140 { type: 'person', category: 'hipster', name: 'Jane', company: 'Acme' }
141 { type: 'person', category: 'hacker', name: 'Jack', city: 'Boston' }
142 { type: 'person', category: 'hustler', name: 'John', country: 'US' }
143 { type: 'company', category: 'enterprise', name: 'Amazon', phone: '13242343' }
144 { type: 'company', category: 'hipster', name: 'Plurimedia' }
145 </pre>
146
147 And an index like:
148 <pre>
149 db.test.ensureIndex({ type: 'S', category: 'S', name: 'S' });
150 </pre>
151
152 The index will be used in queries like:
153 <pre>
154 db.test.find({ type: 'person' }).sort({ name: -1 })
155 db.test.find({ type: 'person', category: 'hipster' })
156 </pre>
157
158 and will NOT be used in query like this
159 <pre>
160 db.test.find({ name: 'Jane' })
161 db.test.find({ category: 'hipster', name: 'Jane' })
162 db.test.find().sort({ name: -1 })
163 </pre>
164
165 and will be used partially (filter on type only) for this query:
166 <pre>
167 db.test.find({ type: 'person', name: 'Jane' })
168 </pre>
169
170 So columns are ordered in the index and you can only use it starting with the first
171 and attaching the others as you defined it in ensureIndex() with an EQ operator or
172 the query (finder) will use the index until the first non EQ operator and then the refiner
173 will filter/sort the rest. Local secondary indexes are created an all indexed attributes,
174 to support non-EQ operators, that means that actually you can index only 5 attributes with
175 this kind of index.
176
177 There is also some support for [$all](http://docs.mongodb.org/manual/reference/operator/query/all/) operator:
178 the fat index can index set fields, actually saving the combinations in the index so that you can query them.
179 ***NOTE***: if you have long set fields this will seriously impact write throughput.
180
181 <pre>
182 db.test.save({ type: 'person', name: 'Jane', tags: ['hipster','hacker'] })
183 db.test.ensureIndex({ tags: 'SS' })
184 db.test.find({ tags: 'hipster' })
185 db.test.find({ tags: { $all: ['hipster','hacker'] } })
186 </pre>
187
188 Inspired by [Twitter Bloodhound](https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md), and
189 [Google Code Search](http://swtch.com/~rsc/regexp/regexp4.html), i recently added a completely experimental $text
190 field to the fat.js index so you can do fulltext searches without using CloudSearch, that is actually too
191 expensive for small apps.
192
193 <pre>
194 db.test.save({ type: 'person', name: 'Jane', about: 'she is the mom of two great childs', tags: ['hipster','hacker'] })
195 db.test.ensureIndex({ name: 'S', $text: function (item) { return _.pick(item,['name','about']); } })
196 db.test.find({ $text: 'mom childs' }) // should return Jane
197 db.test.find({ name: 'Jane', $text: 'mom child' }) // should return Jane, $text is chainable with other fields
198 db.test.find({ name: 'John', $text: 'mom child' }) // should return an empty resultset
199 </pre>
200
201 As you can see you create a $text function in the ensureIndex to manage what fields you want the index to
202 make fulltext-searchable, then you can use a $text field in the query to specify your search terms. The values,
203 are whitespace tokenized and [trigrams](http://swtch.com/~rsc/regexp/regexp4.html) are created for every token,
204 so that each item that has a trigram is saved as a range of an hash for that trigram. When you query, the query
205 string is whitespace tokenized and for each token trigrams are computed, then any item having all those trigrams
206 is returned, the order of words in the query string is ignored. NOTE: This will not scale very well for items with
207 large text to index unless you scale the writes for that index, you should balance the cost of your writes compared
208 to the same usage on CloudSearch and pick what is best for you, also keep in mind that CloudSearch offers [a lot
209 more](http://docs.aws.amazon.com/cloudsearch/latest/developerguide/what-is-cloudsearch.html): stemming, i18n, stoplists, results management...
210
211
212
213* cloud-search.js: is a fulltext index using AWS CloudSearch under the covers.
214
215 Suppose to have the same table as before.
216 And an index like:
217 <pre>
218 db.test.ensureIndex({ $search: { domain: 'test', lang: 'en' } });
219 </pre>
220
221 You can then search the table like this:
222 <pre>
223 db.test.find({ $search: { q: 'Acme' } });
224 db.test.find({ $search: { bq: "type:'contact'", q: 'John' } });
225 </pre>
226
227
228* you could probably build your own specialized indexes too.. just copy the fat.js index and
229 add the new your.js index to the indexed.js finder at the top of indexes array.
230 (probably we should give this as a configuration option)
231
232
233### Lost update prevention
234
235Suppose to have two sessions going on
236
237Session 1 connects and read John
238<pre>
239$ dyngodb2
240> db.test.find({ name: 'John' })
241</pre>
242
243Session 2 connects and read John
244<pre>
245$ dyngodb2
246> db.test.find({ name: 'John' })
247</pre>
248
249Session 1 modifies and saves John
250<pre>
251> last.city= 'San Francisco'
252> db.test.save(last)
253done!
254</pre>
255
256Session 2 modifies and tries to save John and gets an error
257<pre>
258> last.country= 'France'
259> db.test.save(last)
260The item was changed since you read it
261</pre>
262
263This is accomplished by a _rev attribute which is incremented
264at save time if changes are detected in the object since it was read
265(_old attribute contains a clone of the item at read time).
266So when Session 2 tries to save the object it tries to save it
267[expecting](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html#DDB-PutItem-request-Expected) the item to have _old._rev in the table and it fails
268because Session 1 already incremented it.
269
270***note:*** when you get the above error you should reread the object you where trying to save,
271and eventually retry your updates, any other save operation on this object will result in bogus
272responses.
273
274### Arrays
275
276Actually dyngodb is pretty incoherent about arrays, infact it has two kinds of array persistence:
277
278* DynamoDB supports sets which are basically javascript _unordered_ arrays of strings or numbers or binary data,
279 so if dyngodb detects an array of one of those types it persists it like a set (hence loosing its order):
280
281 <pre>
282 db.test.save({ name: 'John', tags: ['developer','hipster','hacker','cool'] })
283 </pre>
284
285
286* Object arrays _are kept in order_ (see [Schema](#schema)):
287
288 <pre>
289 db.test.save({ name: 'John', sons: [{ name: 'Konrad' },{ name: 'Sam' },{ name: 'Jill' }] })
290 </pre>
291
292 this is accomplished via the _pos RANGE attribute of the collection table. So saving the object above
293 would result in 4 items inserted in the DynamoDB table where 2 HASHes are generated (uuid):
294
295 <pre>
296 1. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
297 2. { _id: 'uuid2', _pos: 0, name: 'Konrad' }
298 3. { _id: 'uuid2', _pos: 1, name: 'Sam' }
299 4. { _id: 'uuid2', _pos: 2, name: 'Jill' }
300 </pre>
301 Finding John would get you this structure:
302
303 <pre>
304 db.test.find({ name: 'John' })
305
306 {
307 _id: 'uuid1',
308 _pos: 0,
309 name: 'John',
310 ___sons: 'uuid2',
311 sons: [
312 {
313 _id: 'uuid2',
314 _pos: 0,
315 name: 'Konrad'
316 },
317 {
318 _id: 'uuid2',
319 _pos: 1,
320 name: 'Sam'
321 },
322 {
323 _id: 'uuid2',
324 _pos: 2,
325 name: 'Jill'
326 }
327 ]
328 }
329 </pre>
330
331
332 This means that the array is strored within a single hash, with elements at different ranges,
333 which may be convinient to retrieve those objects if they live toghether with the parent object,
334 or as a list. Which is probably not true for sons...
335 So for the case where you "link" other objects inside the array, like:
336
337 <pre>
338 konrad= { name: 'Konrad' };
339 sam= { name: 'Sam' };
340 jill= { name: 'Jill' };
341 db.test.save(konrad)
342 db.test.save(sam)
343 db.test.save(jill)
344 db.test.save({ name: 'John', sons: [konrad,sam,jill,{ name: 'Edward' }] })
345 </pre>
346
347 here konrad, sam and jill are "standalone" objects with their hashes, that will be linked to the array,
348 while Edward will be contained in it. So in this case things are store like this:
349
350 <pre>
351 1. { _id: 'konrad-uuid', _pos: 0, name: 'Konrad' }
352 2. { _id: 'sam-uuid', _pos: 0, name: 'Sam' }
353 3. { _id: 'jill-uuid', _pos: 0, name: 'Jill' }
354 4. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
355 5. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
356 6. { _id: 'uuid1', _pos: 0, name: 'John', ___sons: 'uuid2' }
357 7. { _id: 'uuid2', _pos: 0, _ref: 'konrad-uuid' }
358 8. { _id: 'uuid2', _pos: 1, _ref: 'sam-uuid' }
359 9. { _id: 'uuid2', _pos: 2, _ref: 'jill-uuid' }
360 10. { _id: 'uuid2', _pos: 3, name: 'Edward' }
361 </pre>
362
363 Now you see the _ref here and you probably understand what is going on. Dyngo stores array placeholders
364 for objects that *lives* in other hashes. Obviously, finding John you will get the right structure:
365
366 <pre>
367 db.test.find({ name: 'John' })
368
369 {
370 _id: 'uuid1',
371 _pos: 0,
372 name: 'John',
373 ___sons: 'uuid2',
374 sons: [
375 {
376 _id: 'konrad-uuid',
377 _pos: 0, // dereferenced from _ref so you get the standalone object with 0 _pos
378 name: 'Konrad'
379 },
380 {
381 _id: 'sam-uuid',
382 _pos: 0,
383 name: 'Sam'
384 },
385 {
386 _id: 'jill-uuid',
387 _pos: 0,
388 name: 'Jill'
389 },
390 {
391 _id: 'uuid2',
392 _pos: 3,
393 name: 'Jill'
394 }
395 ]
396 }
397 </pre>
398
399
400* Arrays of arrays or other type of: they don't work. actually never tested it.
401
402 <pre>
403 db.test.save({ name: 'John', xxx: [[{},{}],[{}],[{}]] })
404 db.test.save({ name: 'John', xxx: [{},[{}],2] })
405 </pre>
406
407### Schema
408
409In dyngodb you have 3 DynamoDB table KeySchema:
410
411* the one used for collections where you have _id (string) as the HASH attribute and _pos (number) as the range attribute.
412 _id, if not specified in the object, is autogenerated with an UUID V4. _pos is always 0 for objects not contained in
413 an array, and is the position of the object in the array for objects contained in arrays (see [Arrays](#arrays)).
414
415* the one used for indexes where you have _hash (string) as HASH attribute and _range (string) as the range attribute.
416 _hash represents probably the container of the results for a certain operator. and _range is used to keep the key
417 attributes of the results (_id+':'+_pos).
418
419* the one used for transaction tables where you have _id (string) as HASH attribute and _item (string) as the range attribute. _id represents the transaction id. _item can be (_) the transaction header item, or some kind of item copy:
420
421 1. the target item to put when the transaction should be applied (target::table::hash attr::hash value::range attr::range value)
422 2. the copy of the item to rollback (copy::table::hash attr::hash value::range attr::range value)
423
424Some automatically generated attributes:
425
426* _id: the object identifier (if not set manually)
427* _pos: the object position in an array
428* _rev: the revision of the object
429* _refs: an array of _id/s referred by the object (indexable as a string set see fat.js)
430* ___&lt;attr name&gt;: placeholders for arrays
431* __&lt;attr name&gt;: placeholders for objects
432* _tx: the transaction locking the item
433* _txLocked: the time when the transaction locked the item
434* _txApplied: the transaction has already modified this item, but is not committed
435* _txTransient: the transaction inserted the item to lock it
436* _txDeleted: the transaction is going to delete the item on commit
437
438
439### Transactions
440
441In dyngodb2 there is basic support for transactions take a look at the [tests] (https://github.com/aaaristo/dyngodb/blob/master/test/transaction.test.js).
442It is an initial implementation of the protocol described [here](https://github.com/awslabs/dynamodb-transactions/blob/master/DESIGN.md).
443All the db.* APIs are still *non-transactional*, while tx.* APIs: that are really
444the same as db.* behaves in a transactional way. Once you get a transaction by
445calling db.transaction() you can operate on any number of tables/items and any modification
446you do is committed or rolledback with the others performed in the same transaction
447(this is true also for items generated by indexes like fat.js, while cloud-search.js
448fulltext search is still non-transactional).
449
450***Keep in mind that this is completely experimental at this stage.***
451
452### Local
453
454It is possible to use [DynamoDB Local](http://aws.typepad.com/aws/2013/09/dynamodb-local-for-desktop-development.html) by adding *--local* to the commandline:
455<pre>
456dyngodb --local
457</pre>
458
459### .dyngorc
460
461Using the *.dyngorc* file you can issue some commands before using the console (e.g. ensureIndex)
462
463### standard input (argv by [optimist](https://github.com/substack/node-optimist))
464
465*commands.txt*
466<pre>
467db.test.save([{ name: argv.somename },{ name: 'Jane' }])
468db.test.save([{ name: 'John' },{ name: 'Jane' }])
469db.test.save([{ name: 'John' },{ name: 'Jane' }])
470</pre>
471
472<pre>
473dyngodb2 --somename Jake &lt; commands.txt
474</pre>
475
476### Streams (for raw dynamodb items)
477
478Example of moving items between tables with streams (10 by 10):
479<pre>
480dyngodb2
481> t1= db._dyn.stream('table1')
482> t2= db._dyn.stream('table2')
483> t1.scan({ limit: 10 }).pipe(t2.mput('put')).on('finish',function () { console.log('done'); })
484</pre>
485
486### basic CSV (todo: stream)
487
488Example of loading a csv file (see [node-csv](https://github.com/wdavidw/node-csv) for options)
489<pre>
490dyngodb2
491> csv('my/path/to.csv',{ delimiter: ';', escape: '"' },['id','name','mail'])
492> last
493> db.mytbl.save(last)
494</pre>
495
496### basic XLSX
497
498Example of loading an xlsx file
499<pre>
500dyngodb2
501> workbook= xlsx('my/path/to.xlsx')
502> contacts= workbook.sheet('Contacts').toJSON(['id','name','mail'])
503> db.mytbl.save(contacts)
504</pre>
505
506### Provisioned Throughput
507
508You can increase the througput automatically (on tables and indexes),
509dyngodb will go through the required steps until it reaches
510the required value.
511
512<pre>
513dyngodb2
514> db.mytbl.modify(1024,1024)
515> db.mytbl.indexes[0].modify(1024,1024)
516</pre>
517
518### Export / import (todo: stream)
519
520Export:
521<pre>
522dyngodb2
523> db.mytbl.find()
524> db.cleanup(last).clean(function (d) { gson('export.gson',d); });
525</pre>
526
527Import:
528<pre>
529dyngodb2
530> db.mytbl.save(gson('export.gson'));
531</pre>
532
533You can use either json or [gson](https://github.com/aaaristo/GSON) function the only difference is that the gson function
534is able to serialize cyrcular object graphs in a non-recursive way.
535
536### Q&D migration from dyngodb to dyngodb2
537
538<pre>
539dyngodb
540> db.mytbl.find()
541> db.cleanup(last).clean(function (d) { gson('export.gson',d); });
542cat export.gson | sed 's/"$id"\:/"_id":/g' > export2.gson
543dyngodb2
544> db.mytbl.save(gson('export2.gson'));
545</pre>
546
547Things you may need to update:
548
549 * your .dyngorc
550 * your code $id -> _id (e.g. sed -i '.old' 's/$id/_id/g' *)
551
552### AngularJS and Express integration
553
554Check: [https://github.com/aaaristo/angular-gson-express-dyngodb](https://github.com/aaaristo/angular-gson-express-dyngodb)
555
556### Help wanted!
557
558Your help is highly appreciated: we need to test / discuss / fix code, performance, roadmap
559
560
561[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/aaaristo/dyngodb/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
562