1 | # mongodb-collection-sample [![][npm_img]][npm_url] [![][travis_img]][travis_url]
|
2 |
|
3 | > Sample documents from a MongoDB collection.
|
4 |
|
5 | ## Install
|
6 |
|
7 | ```
|
8 | npm install --save mongodb-collection-sample
|
9 | ```
|
10 |
|
11 | ## Example
|
12 |
|
13 | ```
|
14 | npm install mongodb lodash mongodb-collection-sample
|
15 | ```
|
16 |
|
17 | ```javascript
|
18 | var sample = require('mongodb-collection-sample');
|
19 | var mongodb = require('mongodb');
|
20 | var _ = require('lodash');
|
21 |
|
22 | // Connect to mongodb
|
23 | mongodb.connect('mongodb://localhost:27017', function(err, db){
|
24 | if(err){
|
25 | console.error('Could not connect to mongodb:', err);
|
26 | return process.exit(1);
|
27 | }
|
28 |
|
29 | // Generate 1000 documents
|
30 | var docs = _range(0, 1000).map(function(i) {
|
31 | return {
|
32 | _id: 'needle_' + i,
|
33 | is_even: i % 2
|
34 | };
|
35 | });
|
36 |
|
37 | // Insert them into a collection
|
38 | db.collection('haystack').insert(docs, function(err){
|
39 | if(err){
|
40 | console.error('Could not insert example documents', err);
|
41 | return process.exit(1);
|
42 | }
|
43 |
|
44 | var options = {};
|
45 | // Size of the sample to capture [default: `5`].
|
46 | options.size = 5;
|
47 |
|
48 | // Query to restrict sample source [default: `{}`]
|
49 | options.query = {};
|
50 |
|
51 | // Get a stream of sample documents from the collection.
|
52 | var stream = sample(db, 'haystack', options);
|
53 | stream.on('error', function(err){
|
54 | console.error('Error in sample stream', err);
|
55 | return process.exit(1);
|
56 | });
|
57 | stream.on('data', function(doc){
|
58 | console.log('Got sampled document `%j`', doc);
|
59 | });
|
60 | stream.on('end', function(){
|
61 | console.log('Sampling complete! Goodbye!');
|
62 | db.close();
|
63 | process.exit(0);
|
64 | });
|
65 | });
|
66 | });
|
67 | ```
|
68 |
|
69 | ## Options
|
70 |
|
71 | Supported options that can be passed to `sample(db, coll, options)` are
|
72 |
|
73 | - `query`: the filter to be used, default is `{}`
|
74 | - `size`: the number of documents to sample, default is `5`
|
75 | - `fields`: the fields you want returned (projection object), default is `null`
|
76 | - `raw`: boolean to return documents as raw BSON buffers, default is `false`
|
77 | - `sort`: the sort field and direction, default is `{_id: -1}`
|
78 | - `maxTimeMS`: the maxTimeMS value after which the operation is terminated, default is `undefined`
|
79 | - `promoteValues`: boolean whether certain BSON values should be cast to native Javascript values or not. Default is `true`
|
80 |
|
81 |
|
82 | ## How It Works
|
83 |
|
84 | #### Native Sampler
|
85 |
|
86 | MongoDB version 3.1.6 and above generally uses the `$sample` aggregation operator:
|
87 |
|
88 | ```
|
89 | db.collectionName.aggregate([
|
90 | {$match: <query>},
|
91 | {$sample: {size: <size>}},
|
92 | {$project: <fields>},
|
93 | {$sort: <sort>}
|
94 | ])
|
95 | ```
|
96 |
|
97 | However, if more documents are requested than are available, the `$sample` stage
|
98 | is omitted for performance optimization. If the sample size is above 5% of the
|
99 | result set count (but less than 100%), the algorithm falls back to the reservoir
|
100 | sampling, to avoid a blocking sort stage on the server.
|
101 |
|
102 |
|
103 | #### Reservoir Sampling
|
104 |
|
105 | For MongoDB version 3.1.5 and below we use a client-size reservoir sampling algorithm.
|
106 |
|
107 | - Query for a stream of _id values, limit 10,000.
|
108 | - Read stream of `_id`s and save `sampleSize` randomly chosen values.
|
109 | - Then query selected random documents by _id.
|
110 |
|
111 | The two modes, illustrated:
|
112 |
|
113 | [![][sampling_post_316_png]][sampling_post_316_svg]
|
114 | [![][sampling_pre_316_png]][sampling_pre_316_svg]
|
115 |
|
116 | ## Performance Notes
|
117 |
|
118 | For peak performance of the client-side reservoir sampler, keep the following guidelines in mind.
|
119 |
|
120 | - The initial query for a stream of `_id` values must be limited to some finite value. (Default 10k)
|
121 | - This query should be covered by an index
|
122 | - Since there's a limit, you may wish to bias for recent documents via a sort. (Default: {_id: -1})
|
123 | - [Don't sort on {$natural: -1}](https://docs.mongodb.org/manual/reference/operator/meta/natural): this forces a collection scan!
|
124 |
|
125 | > Queries that include a sort by $natural order do not use indexes to fulfill the query predicate
|
126 |
|
127 | - When retrieving docs: batch using one $in to reduce network chattiness.
|
128 |
|
129 | ## License
|
130 |
|
131 | Apache 2
|
132 |
|
133 | [travis_img]: https://secure.travis-ci.org/mongodb-js/collection-sample.svg?branch=master
|
134 | [travis_url]: https://travis-ci.org/mongodb-js/collection-sample
|
135 | [npm_img]: https://img.shields.io/npm/v/mongodb-collection-sample.svg
|
136 | [npm_url]: https://www.npmjs.org/package/mongodb-collection-sample
|
137 | [gitter_img]: https://badges.gitter.im/Join%20Chat.svg
|
138 | [sampling_post_316_png]: docs/sampling_analyzing_post_316.png?raw=true
|
139 | [sampling_post_316_svg]: docs/sampling_analyzing_post_316.svg
|
140 | [sampling_pre_316_png]: docs/sampling_analyzing_pre_316.png?raw=true
|
141 | [sampling_pre_316_svg]: docs/sampling_analyzing_pre_316.svg
|