UNPKG

4.51 kBMarkdownView Raw
1# mongodb-collection-sample [![][npm_img]][npm_url] [![][travis_img]][travis_url]
2
3> Sample documents from a MongoDB collection.
4
5## Install
6
7```
8npm install --save mongodb-collection-sample
9```
10
11## Example
12
13```
14npm install mongodb lodash mongodb-collection-sample
15```
16
17```javascript
18var sample = require('mongodb-collection-sample');
19var mongodb = require('mongodb');
20var _ = require('lodash');
21
22// Connect to mongodb
23mongodb.connect('mongodb://localhost:27017', function(err, db){
24 if(err){
25 console.error('Could not connect to mongodb:', err);
26 return process.exit(1);
27 }
28
29 // Generate 1000 documents
30 var docs = _range(0, 1000).map(function(i) {
31 return {
32 _id: 'needle_' + i,
33 is_even: i % 2
34 };
35 });
36
37 // Insert them into a collection
38 db.collection('haystack').insert(docs, function(err){
39 if(err){
40 console.error('Could not insert example documents', err);
41 return process.exit(1);
42 }
43
44 var options = {};
45 // Size of the sample to capture [default: `5`].
46 options.size = 5;
47
48 // Query to restrict sample source [default: `{}`]
49 options.query = {};
50
51 // Get a stream of sample documents from the collection.
52 var stream = sample(db, 'haystack', options);
53 stream.on('error', function(err){
54 console.error('Error in sample stream', err);
55 return process.exit(1);
56 });
57 stream.on('data', function(doc){
58 console.log('Got sampled document `%j`', doc);
59 });
60 stream.on('end', function(){
61 console.log('Sampling complete! Goodbye!');
62 db.close();
63 process.exit(0);
64 });
65 });
66});
67```
68
69## Options
70
71Supported options that can be passed to `sample(db, coll, options)` are
72
73- `query`: the filter to be used, default is `{}`
74- `size`: the number of documents to sample, default is `5`
75- `fields`: the fields you want returned (projection object), default is `null`
76- `raw`: boolean to return documents as raw BSON buffers, default is `false`
77- `sort`: the sort field and direction, default is `{_id: -1}`
78- `maxTimeMS`: the maxTimeMS value after which the operation is terminated, default is `undefined`
79- `promoteValues`: boolean whether certain BSON values should be cast to native Javascript values or not. Default is `true`
80
81
82## How It Works
83
84#### Native Sampler
85
86MongoDB version 3.1.6 and above generally uses the `$sample` aggregation operator:
87
88```
89db.collectionName.aggregate([
90 {$match: <query>},
91 {$sample: {size: <size>}},
92 {$project: <fields>},
93 {$sort: <sort>}
94])
95```
96
97However, if more documents are requested than are available, the `$sample` stage
98is omitted for performance optimization. If the sample size is above 5% of the
99result set count (but less than 100%), the algorithm falls back to the reservoir
100sampling, to avoid a blocking sort stage on the server.
101
102
103#### Reservoir Sampling
104
105For MongoDB version 3.1.5 and below we use a client-size reservoir sampling algorithm.
106
107- Query for a stream of _id values, limit 10,000.
108- Read stream of `_id`s and save `sampleSize` randomly chosen values.
109- Then query selected random documents by _id.
110
111The two modes, illustrated:
112
113[![][sampling_post_316_png]][sampling_post_316_svg]
114[![][sampling_pre_316_png]][sampling_pre_316_svg]
115
116## Performance Notes
117
118For peak performance of the client-side reservoir sampler, keep the following guidelines in mind.
119
120- The initial query for a stream of `_id` values must be limited to some finite value. (Default 10k)
121- This query should be covered by an index
122- Since there's a limit, you may wish to bias for recent documents via a sort. (Default: {_id: -1})
123- [Don't sort on {$natural: -1}](https://docs.mongodb.org/manual/reference/operator/meta/natural): this forces a collection scan!
124
125> Queries that include a sort by $natural order do not use indexes to fulfill the query predicate
126
127- When retrieving docs: batch using one $in to reduce network chattiness.
128
129## License
130
131Apache 2
132
133[travis_img]: https://secure.travis-ci.org/mongodb-js/collection-sample.svg?branch=master
134[travis_url]: https://travis-ci.org/mongodb-js/collection-sample
135[npm_img]: https://img.shields.io/npm/v/mongodb-collection-sample.svg
136[npm_url]: https://www.npmjs.org/package/mongodb-collection-sample
137[gitter_img]: https://badges.gitter.im/Join%20Chat.svg
138[sampling_post_316_png]: docs/sampling_analyzing_post_316.png?raw=true
139[sampling_post_316_svg]: docs/sampling_analyzing_post_316.svg
140[sampling_pre_316_png]: docs/sampling_analyzing_pre_316.png?raw=true
141[sampling_pre_316_svg]: docs/sampling_analyzing_pre_316.svg