UNPKG

crawler/README.md

Version:
19.7 kBMarkdownView Raw
1
2<p align="center">
3  <a href="https://github.com/bda-research/node-crawler">
4    <img alt="Node.js" src="https://raw.githubusercontent.com/bda-research/node-crawler/master/crawler_primary.png" width="400"/>
5  </a>
6</p>
7
8
9#
10[![npm package](https://nodei.co/npm/crawler.png?downloads=true&downloadRank=true&stars=true)](https://nodei.co/npm/crawler/)
11
12[![build status](https://travis-ci.org/bda-research/node-crawler.svg?branch=master)](https://travis-ci.org/bda-research/node-crawler)
13[![Coverage Status](https://coveralls.io/repos/github/bda-research/node-crawler/badge.svg?branch=master)](https://coveralls.io/github/bda-research/node-crawler?branch=master)
14[![Dependency Status](https://david-dm.org/bda-research/node-crawler/status.svg)](https://david-dm.org/bda-research/node-crawler)
15[![NPM download][download-image]][download-url]
16[![NPM quality][quality-image]][quality-url]
17[![Gitter](https://img.shields.io/badge/gitter-join_chat-blue.svg?style=flat-square)](https://gitter.im/node-crawler/discuss?utm_source=badge)
18
19[quality-image]: http://npm.packagequality.com/shield/crawler.svg?style=flat-square
20[quality-url]: http://packagequality.com/#?package=crawler
21[download-image]: https://img.shields.io/npm/dm/crawler.svg?style=flat-square
22[download-url]: https://npmjs.org/package/crawler
23
24
25Most powerful, popular and production crawling/scraping package for Node, happy hacking :)
26
27Features:
28
29 * server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM
30 * Configurable pool size and retries
31 * Control rate limit
32 * Priority queue of requests
33 * forceUTF8 mode to let crawler deal for you with charset detection and conversion
34 * Compatible with 4.x or newer version
35
36Here is the [CHANGELOG](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md)
37
38Thanks to [Authuir](https://github.com/authuir), we have a [Chinese](http://node-crawler.readthedocs.io/zh_CN/latest/) docs. Other languages are welcomed!
39
40# Table of Contents
41- [Get started](#get-started)
42  * [Install](#install)
43  * [Basic usage](#basic-usage)
44  * [Slow down](#slow-down)
45  * [Custom parameters](#custom-parameters)
46  * [Raw body](#raw-body)
47  * [preRequest](#prerequest)
48- [Advanced](#advanced)
49  * [Send request directly](#send-request-directly)
50  * [Work with bottleneck](#work-with-bottleneck)
51  * [Class:Crawler](#classcrawler)
52    + [Event: 'schedule'](#event-schedule)
53    + [Event: 'limiterChange'](#event-limiterchange)
54    + [Event: 'request'](#event-request)
55    + [Event: 'drain'](#event-drain)
56    + [crawler.queue(uri|options)](#crawlerqueueurioptions)
57    + [crawler.queueSize](#crawlerqueuesize)
58  * [Options reference](#options-reference)
59    + [Basic request options](#basic-request-options)
60    + [Callbacks](#callbacks)
61    + [Schedule options](#schedule-options)
62    + [Retry options](#retry-options)
63    + [Server-side DOM options](#server-side-dom-options)
64    + [Charset encoding](#charset-encoding)
65    + [Cache](#cache)
66    + [Http headers](#http-headers)
67  * [Work with Cheerio or JSDOM](#work-with-cheerio-or-jsdom)
68    + [Working with Cheerio](#working-with-cheerio)
69    + [Work with JSDOM](#work-with-jsdom)
70- [How to test](#how-to-test)
71  * [Alternative: Docker](#alternative-docker)
72- [Rough todolist](#rough-todolist)
73
74# Get started
75
76## Install
77
78```sh
79$ npm install crawler
80```
81
82## Basic usage
83
84```js
85var Crawler = require("crawler");
86
87var c = new Crawler({
88    maxConnections : 10,
89    // This will be called for each crawled page
90    callback : function (error, res, done) {
91        if(error){
92            console.log(error);
93        }else{
94            var $ = res.$;
95            // $ is Cheerio by default
96            //a lean implementation of core jQuery designed specifically for the server
97            console.log($("title").text());
98        }
99        done();
100    }
101});
102
103// Queue just one URL, with default callback
104c.queue('http://www.amazon.com');
105
106// Queue a list of URLs
107c.queue(['http://www.google.com/','http://www.yahoo.com']);
108
109// Queue URLs with custom callbacks & parameters
110c.queue([{
111    uri: 'http://parishackers.org/',
112    jQuery: false,
113
114    // The global callback won't be called
115    callback: function (error, res, done) {
116        if(error){
117            console.log(error);
118        }else{
119            console.log('Grabbed', res.body.length, 'bytes');
120        }
121        done();
122    }
123}]);
124
125// Queue some HTML code directly without grabbing (mostly for tests)
126c.queue([{
127    html: '<p>This is a <strong>test</strong></p>'
128}]);
129```
130
131## Slow down
132Use `rateLimit` to slow down when you are visiting web sites.
133
134```js
135var Crawler = require("crawler");
136
137var c = new Crawler({
138    rateLimit: 1000, // `maxConnections` will be forced to 1
139    callback: function(err, res, done){
140        console.log(res.$("title").text());
141        done();
142    }
143});
144
145c.queue(tasks);//between two tasks, minimum time gap is 1000 (ms)
146```
147
148## Custom parameters
149
150Sometimes you have to access variables from previous request/response session, what should you do is passing parameters as same as options:
151
152```js
153c.queue({
154    uri:"http://www.google.com",
155    parameter1:"value1",
156    parameter2:"value2",
157    parameter3:"value3"
158})
159```
160
161then access them in callback via `res.options`
162
163```js
164console.log(res.options.parameter1);
165```
166
167Crawler picks options only needed by request, so dont't worry about the redundance.
168
169## Raw body
170
171If you are downloading files like image, pdf, word etc, you have to save the raw response body which means Crawler shouldn't convert it to string. To make it happen, you need to set encoding to null
172
173```js
174var fs = require('fs');
175
176var c = new Crawler({
177    encoding:null,
178    jQuery:false,// set false to suppress warning message.
179    callback:function(err, res, done){
180        if(err){
181            console.error(err.stack);
182        }else{
183            fs.createWriteStream(res.options.filename).write(res.body);
184        }
185        
186        done();
187    }
188});
189
190c.queue({
191    uri:"https://nodejs.org/static/images/logos/nodejs-1920x1200.png",
192    filename:"nodejs-1920x1200.png"
193});
194
195```
196
197## preRequest
198
199If you want to do something either synchronously or asynchronously before each request, you can try the code below. Note that direct requests won't trigger preRequest.
200
201```js
202var c = new Crawler({
203    preRequest: function(options, done) {
204        // 'options' here is not the 'options' you pass to 'c.queue', instead, it's the options that is going to be passed to 'request' module 
205        console.log(options);
206	// when done is called, the request will start
207	done();
208    },
209    callback: function(err, res, done) {
210        if(err) {
211	    console.log(err)
212	} else {
213	    console.log(res.statusCode)
214	}
215    }
216});
217
218c.queue({
219    uri: 'http://www.google.com',
220    // this will override the 'preRequest' defined in crawler
221    preRequest: function(options, done) {
222        setTimeout(function() {
223	    console.log(options);
224	    done();
225	}, 1000)
226    }
227});
228```
229# Advanced
230## Send request directly
231
232In case you want to send a request directly without going through the scheduler in Crawler, try the code below. `direct` takes the same options as `queue`, please refer to [options](#options-reference) for detail. The difference is when calling `direct`, `callback` must be defined explicitly, with two arguments `error` and `response`, which are the same as that of `callback` of method `queue`.
233
234```js
235crawler.direct({
236    uri: 'http://www.google.com',
237    skipEventRequest: false, // default to true, direct requests won't trigger Event:'request'
238    callback: function(error, response) {
239        if(error) {
240            console.log(error)
241        } else {
242            console.log(response.statusCode);
243        }
244    }
245});
246```
247
248## Work with bottleneck
249
250Control rate limit for with limiter. All tasks submit to a limiter will abide the `rateLimit` and `maxConnections` restrictions of the limiter. `rateLimit` is the minimum time gap between two tasks. `maxConnections` is the maximum number of tasks that can be running at the same time. Limiters are independent of each other. One common use case is setting different limiters for different proxies. One thing is worth noticing, when `rateLimit` is set to a non-zero value, `maxConnections` will be forced to 1.
251
252```js
253var crawler = require('crawler');
254
255var c = new Crawler({
256    rateLimit: 2000,
257    maxConnections: 1,
258    callback: function(error, res, done) {
259        if(error) {
260            console.log(error)
261        } else {
262            var $ = res.$;
263            console.log($('title').text())
264        }
265        done();
266    }
267})
268
269// if you want to crawl some website with 2000ms gap between requests
270c.queue('http://www.somewebsite.com/page/1')
271c.queue('http://www.somewebsite.com/page/2')
272c.queue('http://www.somewebsite.com/page/3')
273
274// if you want to crawl some website using proxy with 2000ms gap between requests for each proxy
275c.queue({
276    uri:'http://www.somewebsite.com/page/1',
277    limiter:'proxy_1',
278    proxy:'proxy_1'
279})
280c.queue({
281    uri:'http://www.somewebsite.com/page/2',
282    limiter:'proxy_2',
283    proxy:'proxy_2'
284})
285c.queue({
286    uri:'http://www.somewebsite.com/page/3',
287    limiter:'proxy_3',
288    proxy:'proxy_3'
289})
290c.queue({
291    uri:'http://www.somewebsite.com/page/4',
292    limiter:'proxy_1',
293    proxy:'proxy_1'
294})
295```
296
297Normally, all limiter instances in limiter cluster in crawler are instantiated with options specified in crawler constructor. You can change property of any limiter by calling the code below. Currently, we only support changing property 'rateLimit' of limiter. Note that the default limiter can be accessed by `c.setLimiterProperty('default', 'rateLimit', 3000)`. We strongly recommend that you leave limiters unchanged after their instantiation unless you know clearly what you are doing.
298```js
299var c = new Crawler({});
300c.setLimiterProperty('limiterName', 'propertyName', value)
301```
302
303 
304## Class:Crawler
305
306### Event: 'schedule'
307 * `options` [Options](#options-reference)
308
309Emitted when a task is being added to scheduler.
310
311```js
312crawler.on('schedule',function(options){
313    options.proxy = "http://proxy:port";
314});
315```
316
317### Event: 'limiterChange'
318 * `options` [Options](#options-reference)
319 * `limiter` [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type)
320
321Emitted when limiter has been changed.
322
323### Event: 'request'
324 * `options` [Options](#options-reference)
325
326Emitted when crawler is ready to send a request.
327
328If you are going to modify options at last stage before requesting, just listen on it.
329
330```js
331crawler.on('request',function(options){
332    options.qs.timestamp = new Date().getTime();
333});
334```
335
336### Event: 'drain'
337
338Emitted when queue is empty.
339
340```js
341crawler.on('drain',function(){
342    // For example, release a connection to database.
343    db.end();// close connection to MySQL
344});
345```
346
347### crawler.queue(uri|options)
348 * `uri` [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type)
349 * `options` [Options](#options-reference)
350
351Enqueue a task and wait for it to be executed.
352
353### crawler.queueSize
354 * [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type)
355
356Size of queue, read-only
357
358
359## Options reference
360
361
362You can pass these options to the Crawler() constructor if you want them to be global or as
363items in the queue() calls if you want them to be specific to that item (overwriting global options)
364
365This options list is a strict superset of [mikeal's request options](https://github.com/mikeal/request#requestoptions-callback) and will be directly passed to
366the request() method.
367
368### Basic request options
369
370 * `options.uri`: [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type) The url you want to crawl.
371 * `options.timeout` : [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) In milliseconds (Default 15000).
372 * [All mikeal's request options are accepted](https://github.com/mikeal/request#requestoptions-callback).
373
374### Callbacks
375
376 * `callback(error, res, done)`: Function that will be called after a request was completed
377     * `error`: [Error](https://nodejs.org/api/errors.html)
378     * `res`: [http.IncomingMessage](https://nodejs.org/api/http.html#http_class_http_incomingmessage) A response of standard IncomingMessage includes `$` and `options`
379         * `res.statusCode`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) HTTP status code. E.G.`200`
380         * `res.body`: [Buffer](https://nodejs.org/api/buffer.html) | [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type) HTTP response content which could be a html page, plain text or xml document e.g.
381         * `res.headers`: [Object](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object) HTTP response headers
382         * `res.request`: [Request](https://github.com/request/request)  An instance of Mikeal's `Request` instead of [http.ClientRequest](https://nodejs.org/api/http.html#http_class_http_clientrequest)
383             * `res.request.uri`: [urlObject](https://nodejs.org/api/url.html#url_url_strings_and_url_objects) HTTP request entity of parsed url
384             * `res.request.method`: [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type) HTTP request method. E.G. `GET`
385             * `res.request.headers`: [Object](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object) HTTP request headers
386         * `res.options`: [Options](#options-reference) of this task
387         * `$`: [jQuery Selector](https://api.jquery.com/category/selectors/) A selector for  html or xml document.
388     * `done`: [Function](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function) It must be called when you've done your work in callback.
389
390### Schedule options
391
392 * `options.maxConnections`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) Size of the worker pool (Default 10).
393 * `options.rateLimit`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) Number of milliseconds to delay between each requests (Default 0).
394 * `options.priorityRange`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) Range of acceptable priorities starting from 0 (Default 10).
395 * `options.priority`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) Priority of this request (Default 5). Low values have higher priority.
396
397### Retry options
398
399 * `options.retries`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) Number of retries if the request fails (Default 3),
400 * `options.retryTimeout`: [Number](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) Number of milliseconds to wait before retrying (Default 10000),
401
402### Server-side DOM options
403
404 * `options.jQuery`: [Boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type)|[String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type)|[Object](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object) Use `cheerio` with default configurations to inject document if true or "cheerio". Or use customized `cheerio` if an object with [Parser options](https://github.com/fb55/htmlparser2/wiki/Parser-options). Disable injecting jQuery selector if false. If you have memory leak issue in your project, use "whacko", an alternative parser,to avoid that. (Default true)
405
406### Charset encoding
407
408 * `options.forceUTF8`: [Boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If true crawler will get charset from HTTP headers or meta tag in html and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default true),
409 * `options.incomingEncoding`: [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type) With forceUTF8: true to set encoding manually (Default null) so that crawler will not have to detect charset by itself. For example, `incomingEncoding : 'windows-1255'`. See [all supported encodings](https://github.com/ashtuchkin/iconv-lite/wiki/Supported-Encodings)
410
411### Cache
412
413 * `options.skipDuplicates`: [Boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If true skips URIs that were already crawled, without even calling callback() (Default false). __This is not recommended__, it's better to handle outside `Crawler` use [seenreq](https://github.com/mike442144/seenreq)
414
415### Http headers
416
417 * `options.rotateUA`: [Boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If true, `userAgent` should be an array and rotate it (Default false) 
418 * `options.userAgent`: [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type)|[Array](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array), If `rotateUA` is false, but `userAgent` is an array, crawler will use the first one.
419 * `options.referer`: [String](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#String_type) If truthy sets the HTTP referer header
420 * `options.headers`: [Object](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object) Raw key-value of http headers
421
422
423
424## Work with Cheerio or JSDOM
425
426
427Crawler by default use [Cheerio](https://github.com/cheeriojs/cheerio) instead of [JSDOM](https://github.com/tmpvar/jsdom). JSDOM is more robust, if you want to use JSDOM you will have to require it `require('jsdom')` in your own script before passing it to crawler.
428
429### Working with Cheerio
430```js
431jQuery: true //(default)
432//OR
433jQuery: 'cheerio'
434//OR
435jQuery: {
436    name: 'cheerio',
437    options: {
438        normalizeWhitespace: true,
439        xmlMode: true
440    }
441}
442```
443These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. The default options are:
444
445```js
446{
447    normalizeWhitespace: false,
448    xmlMode: false,
449    decodeEntities: true
450}
451```
452
453For a full list of options and their effects, see [this](https://github.com/fb55/DomHandler) and
454[htmlparser2's options](https://github.com/fb55/htmlparser2/wiki/Parser-options).
455[source](https://github.com/cheeriojs/cheerio#loading)
456
457### Work with JSDOM
458
459In order to work with JSDOM you will have to install it in your project folder `npm install jsdom`, and pass it to crawler.
460
461```js
462var jsdom = require('jsdom');
463var Crawler = require('crawler');
464
465var c = new Crawler({
466    jQuery: jsdom
467});
468```
469
470# How to test
471
472Crawler uses `nock` to mock http request, thus testing no longer relying on http server.
473
474```bash
475$ npm install
476$ npm test
477$ npm run cover # code coverage
478```
479
480## Alternative: Docker
481
482After [installing Docker](http://docs.docker.com/), you can run:
483
484```bash
485# Builds the local test environment
486$ docker build -t node-crawler .
487
488# Runs tests
489$ docker run node-crawler sh -c "npm install && npm test"
490
491# You can also ssh into the container for easier debugging
492$ docker run -i -t node-crawler bash
493```
494
495
496# Rough todolist
497
498 * Introducing zombie to deal with page with complex ajax
499 * Refactoring the code to be more maintainable
500 * Make Sizzle tests pass (JSDOM bug? https://github.com/tmpvar/jsdom/issues#issue/81)
501 * Promise support
502 * Commander support
503 * Middleware support