UNPKG

crawler-ninja/README.md

Version:

14 kBMarkdownView Raw

1Crawler Ninja
2-------------
3
4This crawler aims to build custom solutions for crawling/scraping sites.
5For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
6
7This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
8
9The best environment to run Crawler Ninja is a linux server.
10
11
12Help & Forks welcomed ! or please wait ... work in progress !
13
14How to install
15--------------
16
17    $ npm install crawler-ninja --save
18
19
20Crash course
21------------
22###How to use an existing plugin ?
23
24```javascript
25var crawler = require("crawler-ninja");
26var logger  = require("crawler-ninja/plugins/log-plugin");
27
28var c = new crawler.Crawler();
29var log = new logger.Plugin(c);
30
31c.on("end", function() {
32
33    var end = new Date();
34    console.log("End of crawl !, done in : " + (end - start));
35
36
37});
38
39var start = new Date();
40c.queue({url : "http://www.mysite.com/"});
41```
42This script logs on the console all crawled pages thanks to the usage of the log-plugin component.
43
44The Crawler component emits different kind of events that plugins can use (see below).
45When the crawl ends, the event 'end' is emitted.
46
47###Create a new plugin
48
49The following script show you the events callbacks that your have to implement for creating a new plugin.
50
51This is not mandatory to implement all crawler events. You can also reduce the scope of the crawl by using the different crawl options (see below the section :  option references).
52
53
54```javascript
55
56// userfull lib for managing uri
57var URI    = require('crawler/lib/uri');
58
59
60function Plugin(crawler) {
61
62    this.crawler = crawler;
63
64    /**
65     * Emits when the crawler found an error
66     *
67     * @param the usual error object
68     * @param the result of the request (contains uri, headers, ...)
69     */
70    this.crawler.on("error", function(error, result) {
71
72    });
73
74    /**
75     * Emits when the crawler crawls a resource (html,js,css, pdf, ...)
76     *
77     * @param result : the result of the crawled resource
78     * @param the jquery like object for accessing to the HTML tags. Null is if the resource is not an HTML.
79     * See the project cheerio : https://github.com/cheeriojs/cheerio
80     */
81    this.crawler.on("crawl", function(result,$) {
82
83    });
84
85    /**
86     * Emits when the crawler founds a link in a page
87     *
88     * @param the page that contains the link
89     * @param the link uri
90     * @param the anchor text
91     * @param true if the link is do follow
92     */
93    this.crawler.on("crawlLink", function(page, link, anchor, isDoFollow) {
94
95    });
96
97
98    /**
99     * Emits when the crawler founds an image
100     *
101     * @param the page that contains the image
102     * @param the image uri
103     * @param the alt text
104     */
105    this.crawler.on("crawlImage", function(page, link, alt) {
106
107
108    });
109
110    /**
111     * Emits when the crawler founds a redirect 3**
112     *
113     * @param the from url
114     * @param the to url
115     * @param statusCode : the exact status code : 301, 302, ...
116     */
117    this.crawler.on("crawlRedirect", function(from, to, statusCode) {
118
119    });
120
121}
122
123module.exports.Plugin = Plugin;
124
125```
126
127
128Option references
129-----------------
130
131
132### The main crawler config options
133
134You can pass these options to the Crawler() constructor like :
135
136```javascript
137
138
139var c = new crawler.Crawler({
140  externalLinks : true,
141  scripts : false,
142  images : false
143});
144
145
146```
147
148- maxConnections     : the number of connections used to crawl, default is 10.
149- externalLinks      : if true crawl external links, default = false.
150- externalDomains    : if true crawl the  external domains. This option can crawl a lot of different linked domains, defaukt = false.
151- externalHosts      : if true crawl the others hosts on the same domain, default = false.
152- scripts            : if true crawl script tags, default = true.
153- links              : if true crawl link tags, default = true.
154- linkTypes          : the type of the links tags to crawl (match to the rel attribute), default = ["canonical", "stylesheet"].
155- images             : if true crawl images, default = true.
156- protocols          : list of the protocols to crawl, default = ["http", "https"].
157- timeout            : timeout per requests in milliseconds, default = 20000.
158- retries            : number of retries if the request is on timeout, default = 3.
159- retryTimeout       : number of milliseconds to wait before retrying,  default = 10000.
160- maxErrors          : number of timeout errors before forcing  to decrease the crawl rate, default is 5. If the value is -1, there is no check.
161- errorRates         : an array of crawl rates to apply if there are no too many errors, default : [200,350,500] (in ms)
162- skipDuplicates     : if true skips URIs that were already crawled, default is true.
163- rateLimits         : number of milliseconds to delay between each requests , default = 0.
164- depthLimit         : the depth limit for the crawl, default is no limit.
165- followRedirect     : if true, the crawl will not return the 301, it will
166 follow directly the redirection, default is false.
167- userAgent          : String, defaults to "node-crawler/[version]"
168- referer            : String, if truthy sets the HTTP referer header
169- domainBlackList    : The list of domain names (without tld) to avoid to crawl (an array of String). The default list is in the file :  /default-lists/domain-black-list.js
170- proxyList          : The list of proxy to use for each crawler request (see below).
171
172
173You can also use the [mikeal's request options](https://github.com/mikeal/request#requestoptions-callback) and will be directly passed to the request() method.
174
175You can pass these options to the Crawler() constructor if you want them to be global or as
176items in the queue() calls if you want them to be specific to that item (overwriting global options)
177
178
179
180### Add your own crawl rules
181
182If the predefined options are not sufficiants, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.
183
184
185```javascript
186
187
188var c = new crawler.Crawler({
189  // add here predefined options you want to override
190
191  /**
192   *  this callback is called for each link found in an html page
193   *  @param  : the uri of the page that contains the link
194   *  @param  : the uri of the link to check
195   *  @param  : the anchor text of the link
196   *  @param  : true if the link is dofollow
197   *  @return : true if the crawler can crawl the link on this html page
198   */
199  canCrawl : function(htlmPage, link, anchor, isDoFollow) {
200      return isDoFollow;
201  }
202
203});
204
205
206```
207
208
209Using proxies
210-------------
211
212Crawler.ninja can be configured to execute each http request through a proxy.
213It uses the npm package [simple-proxies](https://github.com/christophebe/simple-proxies).
214
215You have to install it in your project with the command :
216
217    $ npm install simple-proxies --save
218
219
220Here is a code sample that uses proxies from a file :
221
222```javascript
223var proxyLoader = require("simple-proxies/lib/proxyfileloader");
224var crawler     = require("crawler-ninja");
225var logger      = require("crawler-ninja/plugins/log-plugin");
226
227
228var proxyFile = "proxies.txt";
229
230// Load proxies
231var config = proxyLoader.config()
232                        .setProxyFile(proxyFile)
233                        .setCheckProxies(false)
234                        .setRemoveInvalidProxies(false);
235
236proxyLoader.loadProxyFile(config, function(error, proxyList) {
237    if (error) {
238      console.log(error);
239
240    }
241    else {
242       crawl(proxyList);
243    }
244
245});
246
247
248function crawl(proxyList){
249    var c = new crawler.Crawler({
250        externalLinks : true,
251        images : false,
252        scripts : false,
253        links : false, //link tags used for css, canonical, ...
254        followRedirect : true,
255        proxyList : proxyList
256    });
257
258    var log = new logger.Plugin(c);
259
260    c.on("end", function() {
261
262        var end = new Date();
263        console.log("Well done Sir !, done in : " + (end - start));
264
265
266    });
267
268    var start = new Date();
269    c.queue({url : "http://www.site.com"});
270}
271
272```
273
274Using the crawl logger in your own plugin
275------------------------------------------
276
277The current crawl logger is based on [Bunyan](https://github.com/trentm/node-bunyan).
278- It logs the all crawl actions & errors in the file ./logs/crawler.log.
279- The list of errors can be found in ./logs/errors.log.
280
281You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.
282
283You can also use the current logger or create a new one in your own Plugin.
284
285*Use default loggers*
286
287
288```javascript
289
290var log = require("crawler-ninja./lib/logger.js").Logger;
291
292log.info("log info");  // Log into crawler.log
293log.debug("log debug"); // Log into crawler.log
294log.error("log error"); // Log into crawler.log & errors.log
295log.info({statusCode : 200, url: "http://www.google.com" }) // log a json
296```
297
298*Create a new logger for your plugin*
299
300```javascript
301// Log into crawler.log
302var log = require("crawler-ninja/lib/logger.js");
303
304var myLog = log.createLogger("myLoggerName", "./logs/myplugin.log");
305
306myLog.log({url:"http://www.google.com", pageRank : 10});
307
308```
309
310Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.  
311
312More features & flexibilities will be added in the upcoming releases.
313
314
315Control the crawl rate
316-----------------------
317All sites cannot support an intensive crawl. This crawl provide 2 solutions to control the crawl rates :
318- implicit : the crawl decrease the crawl rate if there are too many timeouts on a host. the crawl rate is controlled for each crawled hosts separately.
319- explicit : you can specify the crawl rate in the crawler config. This setting is unique for all hosts.
320
321
322**Implicit setting**
323
324Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timouts persist, the crawler will cancel the crawl on that host.
325
326You can change the default values for this implicit setting (5 timout errors & rates = 200, 350, 500ms). Here is an example :
327
328```javascript
329var crawler = require("crawler-ninja");
330var logger  = require("crawler-ninja/plugins/log-plugin");
331
332var c = new crawler.Crawler({
333  // new values for the implicit setting
334  maxErrors : 5,
335  errorRates : [300, 600, 900]
336
337});
338
339var log = new logger.Plugin(c);
340
341c.on("end", function() {
342
343    var end = new Date();
344    console.log("End of crawl !, done in : " + (end - start));
345
346
347});
348
349var start = new Date();
350c.queue({url : "http://www.mysite.com/"});
351```
352Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting
353
354**Explicit setting**
355
356In this configuration, you are apply the same crawl rate for all requests on all hosts.
357
358```javascript
359var crawler = require("crawler-ninja");
360var logger  = require("crawler-ninja/plugins/log-plugin");
361
362var c = new crawler.Crawler({
363  rateLimits         : 200 //200ms between each request
364
365});
366
367var log = new logger.Plugin(c);
368
369c.on("end", function() {
370
371    var end = new Date();
372    console.log("End of crawl !, done in : " + (end - start));
373
374
375});
376
377var start = new Date();
378c.queue({url : "http://www.mysite.com/"});
379```
380
381
382If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".
383
384Current Plugins
385---------------
386
387- Log
388- Stat
389- Audit
390
391
392Rough todolist
393--------------
394
395 * More & more plugins (in progress)
396 * Use Riak as default persistence layer/Crawler Store
397 * Multicore architecture and/or micro service architecture for plugins that requires a lot of CPU usage
398 * CLI for extracting data from the Crawl DB
399 * Build UI : dashboards, view project data, ...
400
401
402ChangeLog
403---------
404
4050.1.0
406 - crawler engine that support navigation through a.href, detect images, links tag & scripts.
407 - Add flexible parameters to crawl (see the section crawl option above) like the crawl depth, crawl rates, craw external links, ...
408 - Implement a basic log plugin & an SEO audit plugin.
409 - Unit tests.
410
4110.1.1
412 - Add proxy support.
413 - Gives the possibility to crawl (or not) the external domains which is different than crawling only the external links. Crawl external links means to check the http status & content of the linked external resources which is different of expand the crawl through the entire external domains.
414
4150.1.2
416 - Review Log component.
417 - set the default userAgent to NinjaBot.
418 - update README.
419
4200.1.3
421 - avoid crash for long crawls.
422
4230.1.4
424 - code refactoring in order to make a tentative of multicore proccess for making http requests
425
4260.1.5
427  - remove the multicore support for making http requests due to the important overhead. Plan to use multicore for some intensive CPU plugins.
428  - refactor the rate limit and http request retries in case of errors.
429
4300.1.6
431 - Review logger : use winston, different log files : the full crawl, errors and urls. Gives the possibility to create a specific logger for a plugin.
432
4330.1.7
434  - Too many issues with winston, use Bunyan for the logs
435  - Refactor how to set the urls in the crawl option : simple url, an array of urls or of json option objects.
436  - Review the doc aka README
437  - Review how to manage the timeouts in function of the site to crawl. If too many timeouts for one domain, the crawler will change the settings in order to decrease request concurrency. If errors persist, the crawler will stop to crawl this domain.
438  - Add support for a blacklist of domains.
439
4400.1.8
441
442 - Add options to limit the crawl for one host or one entire domain.
443-  Add option to follow robots.txt rules.    

1	`Crawler Ninja`
2	`-------------`
3
4	`This crawler aims to build custom solutions for crawling/scraping sites.`
5	`For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...`
6
7	`This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.`
8
9	`The best environment to run Crawler Ninja is a linux server.`
10
11
12	`Help & Forks welcomed ! or please wait ... work in progress !`
13
14	`How to install`
15	`--------------`
16
17	`$ npm install crawler-ninja --save`
18
19
20	`Crash course`
21	`------------`
22	`###How to use an existing plugin ?`
23
24	```javascript
25	`var crawler = require("crawler-ninja");`
26	`var logger = require("crawler-ninja/plugins/log-plugin");`
27
28	`var c = new crawler.Crawler();`
29	`var log = new logger.Plugin(c);`
30
31	`c.on("end", function() {`
32
33	`var end = new Date();`
34	`console.log("End of crawl !, done in : " + (end - start));`
35
36
37	`});`
38
39	`var start = new Date();`
40	`c.queue({url : "http://www.mysite.com/"});`
41	```
42	`This script logs on the console all crawled pages thanks to the usage of the log-plugin component.`
43
44	`The Crawler component emits different kind of events that plugins can use (see below).`
45	`When the crawl ends, the event 'end' is emitted.`
46
47	`###Create a new plugin`
48
49	`The following script show you the events callbacks that your have to implement for creating a new plugin.`
50
51	`This is not mandatory to implement all crawler events. You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).`
52
53
54	```javascript
55
56	`// userfull lib for managing uri`
57	`var URI = require('crawler/lib/uri');`
58
59
60	`function Plugin(crawler) {`
61
62	`this.crawler = crawler;`
63
64	`/**`
65	`* Emits when the crawler found an error`
66	`*`
67	`* @param the usual error object`
68	`* @param the result of the request (contains uri, headers, ...)`
69	`*/`
70	`this.crawler.on("error", function(error, result) {`
71
72	`});`
73
74	`/**`
75	`* Emits when the crawler crawls a resource (html,js,css, pdf, ...)`
76	`*`
77	`* @param result : the result of the crawled resource`
78	`* @param the jquery like object for accessing to the HTML tags. Null is if the resource is not an HTML.`
79	`* See the project cheerio : https://github.com/cheeriojs/cheerio`
80	`*/`
81	`this.crawler.on("crawl", function(result,$) {`
82
83	`});`
84
85	`/**`
86	`* Emits when the crawler founds a link in a page`
87	`*`
88	`* @param the page that contains the link`
89	`* @param the link uri`
90	`* @param the anchor text`
91	`* @param true if the link is do follow`
92	`*/`
93	`this.crawler.on("crawlLink", function(page, link, anchor, isDoFollow) {`
94
95	`});`
96
97
98	`/**`
99	`* Emits when the crawler founds an image`
100	`*`
101	`* @param the page that contains the image`
102	`* @param the image uri`
103	`* @param the alt text`
104	`*/`
105	`this.crawler.on("crawlImage", function(page, link, alt) {`
106
107
108	`});`
109
110	`/**`
111	`* Emits when the crawler founds a redirect 3**`
112	`*`
113	`* @param the from url`
114	`* @param the to url`
115	`* @param statusCode : the exact status code : 301, 302, ...`
116	`*/`
117	`this.crawler.on("crawlRedirect", function(from, to, statusCode) {`
118
119	`});`
120
121	`}`
122
123	`module.exports.Plugin = Plugin;`
124
125	```
126
127
128	`Option references`
129	`-----------------`
130
131
132	`### The main crawler config options`
133
134	`You can pass these options to the Crawler() constructor like :`
135
136	```javascript
137
138
139	`var c = new crawler.Crawler({`
140	`externalLinks : true,`
141	`scripts : false,`
142	`images : false`
143	`});`
144
145
146	```
147
148	`- maxConnections : the number of connections used to crawl, default is 10.`
149	`- externalLinks : if true crawl external links, default = false.`
150	`- externalDomains : if true crawl the external domains. This option can crawl a lot of different linked domains, defaukt = false.`
151	`- externalHosts : if true crawl the others hosts on the same domain, default = false.`
152	`- scripts : if true crawl script tags, default = true.`
153	`- links : if true crawl link tags, default = true.`
154	`- linkTypes : the type of the links tags to crawl (match to the rel attribute), default = ["canonical", "stylesheet"].`
155	`- images : if true crawl images, default = true.`
156	`- protocols : list of the protocols to crawl, default = ["http", "https"].`
157	`- timeout : timeout per requests in milliseconds, default = 20000.`
158	`- retries : number of retries if the request is on timeout, default = 3.`
159	`- retryTimeout : number of milliseconds to wait before retrying, default = 10000.`
160	`- maxErrors : number of timeout errors before forcing to decrease the crawl rate, default is 5. If the value is -1, there is no check.`
161	`- errorRates : an array of crawl rates to apply if there are no too many errors, default : [200,350,500] (in ms)`
162	`- skipDuplicates : if true skips URIs that were already crawled, default is true.`
163	`- rateLimits : number of milliseconds to delay between each requests , default = 0.`
164	`- depthLimit : the depth limit for the crawl, default is no limit.`
165	`- followRedirect : if true, the crawl will not return the 301, it will`
166	`follow directly the redirection, default is false.`
167	`- userAgent : String, defaults to "node-crawler/[version]"`
168	`- referer : String, if truthy sets the HTTP referer header`
169	`- domainBlackList : The list of domain names (without tld) to avoid to crawl (an array of String). The default list is in the file : /default-lists/domain-black-list.js`
170	`- proxyList : The list of proxy to use for each crawler request (see below).`
171
172
173	`You can also use the [mikeal's request options](https://github.com/mikeal/request#requestoptions-callback) and will be directly passed to the request() method.`
174
175	`You can pass these options to the Crawler() constructor if you want them to be global or as`
176	`items in the queue() calls if you want them to be specific to that item (overwriting global options)`
177
178
179
180	`### Add your own crawl rules`
181
182	`If the predefined options are not sufficiants, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.`
183
184
185	```javascript
186
187
188	`var c = new crawler.Crawler({`
189	`// add here predefined options you want to override`
190
191	`/**`
192	`* this callback is called for each link found in an html page`
193	`* @param : the uri of the page that contains the link`
194	`* @param : the uri of the link to check`
195	`* @param : the anchor text of the link`
196	`* @param : true if the link is dofollow`
197	`* @return : true if the crawler can crawl the link on this html page`
198	`*/`
199	`canCrawl : function(htlmPage, link, anchor, isDoFollow) {`
200	`return isDoFollow;`
201	`}`
202
203	`});`
204
205
206	```
207
208
209	`Using proxies`
210	`-------------`
211
212	`Crawler.ninja can be configured to execute each http request through a proxy.`
213	`It uses the npm package [simple-proxies](https://github.com/christophebe/simple-proxies).`
214
215	`You have to install it in your project with the command :`
216
217	`$ npm install simple-proxies --save`
218
219
220	`Here is a code sample that uses proxies from a file :`
221
222	```javascript
223	`var proxyLoader = require("simple-proxies/lib/proxyfileloader");`
224	`var crawler = require("crawler-ninja");`
225	`var logger = require("crawler-ninja/plugins/log-plugin");`
226
227
228	`var proxyFile = "proxies.txt";`
229
230	`// Load proxies`
231	`var config = proxyLoader.config()`
232	`.setProxyFile(proxyFile)`
233	`.setCheckProxies(false)`
234	`.setRemoveInvalidProxies(false);`
235
236	`proxyLoader.loadProxyFile(config, function(error, proxyList) {`
237	`if (error) {`
238	`console.log(error);`
239
240	`}`
241	`else {`
242	`crawl(proxyList);`
243	`}`
244
245	`});`
246
247
248	`function crawl(proxyList){`
249	`var c = new crawler.Crawler({`
250	`externalLinks : true,`
251	`images : false,`
252	`scripts : false,`
253	`links : false, //link tags used for css, canonical, ...`
254	`followRedirect : true,`
255	`proxyList : proxyList`
256	`});`
257
258	`var log = new logger.Plugin(c);`
259
260	`c.on("end", function() {`
261
262	`var end = new Date();`
263	`console.log("Well done Sir !, done in : " + (end - start));`
264
265
266	`});`
267
268	`var start = new Date();`
269	`c.queue({url : "http://www.site.com"});`
270	`}`
271
272	```
273
274	`Using the crawl logger in your own plugin`
275	`------------------------------------------`
276
277	`The current crawl logger is based on [Bunyan](https://github.com/trentm/node-bunyan).`
278	`- It logs the all crawl actions & errors in the file ./logs/crawler.log.`
279	`- The list of errors can be found in ./logs/errors.log.`
280
281	`You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.`
282
283	`You can also use the current logger or create a new one in your own Plugin.`
284
285	`Use default loggers`
286
287
288	```javascript
289
290	`var log = require("crawler-ninja./lib/logger.js").Logger;`
291
292	`log.info("log info"); // Log into crawler.log`
293	`log.debug("log debug"); // Log into crawler.log`
294	`log.error("log error"); // Log into crawler.log & errors.log`
295	`log.info({statusCode : 200, url: "http://www.google.com" }) // log a json`
296	```
297
298	`Create a new logger for your plugin`
299
300	```javascript
301	`// Log into crawler.log`
302	`var log = require("crawler-ninja/lib/logger.js");`
303
304	`var myLog = log.createLogger("myLoggerName", "./logs/myplugin.log");`
305
306	`myLog.log({url:"http://www.google.com", pageRank : 10});`
307
308	```
309
310	`Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.`
311
312	`More features & flexibilities will be added in the upcoming releases.`
313
314
315	`Control the crawl rate`
316	`-----------------------`
317	`All sites cannot support an intensive crawl. This crawl provide 2 solutions to control the crawl rates :`
318	`- implicit : the crawl decrease the crawl rate if there are too many timeouts on a host. the crawl rate is controlled for each crawled hosts separately.`
319	`- explicit : you can specify the crawl rate in the crawler config. This setting is unique for all hosts.`
320
321
322	`Implicit setting`
323
324	`Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timouts persist, the crawler will cancel the crawl on that host.`
325
326	`You can change the default values for this implicit setting (5 timout errors & rates = 200, 350, 500ms). Here is an example :`
327
328	```javascript
329	`var crawler = require("crawler-ninja");`
330	`var logger = require("crawler-ninja/plugins/log-plugin");`
331
332	`var c = new crawler.Crawler({`
333	`// new values for the implicit setting`
334	`maxErrors : 5,`
335	`errorRates : [300, 600, 900]`
336
337	`});`
338
339	`var log = new logger.Plugin(c);`
340
341	`c.on("end", function() {`
342
343	`var end = new Date();`
344	`console.log("End of crawl !, done in : " + (end - start));`
345
346
347	`});`
348
349	`var start = new Date();`
350	`c.queue({url : "http://www.mysite.com/"});`
351	```
352	`Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting`
353
354	`Explicit setting`
355
356	`In this configuration, you are apply the same crawl rate for all requests on all hosts.`
357
358	```javascript
359	`var crawler = require("crawler-ninja");`
360	`var logger = require("crawler-ninja/plugins/log-plugin");`
361
362	`var c = new crawler.Crawler({`
363	`rateLimits : 200 //200ms between each request`
364
365	`});`
366
367	`var log = new logger.Plugin(c);`
368
369	`c.on("end", function() {`
370
371	`var end = new Date();`
372	`console.log("End of crawl !, done in : " + (end - start));`
373
374
375	`});`
376
377	`var start = new Date();`
378	`c.queue({url : "http://www.mysite.com/"});`
379	```
380
381
382	`If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".`
383
384	`Current Plugins`
385	`---------------`
386
387	`- Log`
388	`- Stat`
389	`- Audit`
390
391
392	`Rough todolist`
393	`--------------`
394
395	`* More & more plugins (in progress)`
396	`* Use Riak as default persistence layer/Crawler Store`
397	`* Multicore architecture and/or micro service architecture for plugins that requires a lot of CPU usage`
398	`* CLI for extracting data from the Crawl DB`
399	`* Build UI : dashboards, view project data, ...`
400
401
402	`ChangeLog`
403	`---------`
404
405	`0.1.0`
406	`- crawler engine that support navigation through a.href, detect images, links tag & scripts.`
407	`- Add flexible parameters to crawl (see the section crawl option above) like the crawl depth, crawl rates, craw external links, ...`
408	`- Implement a basic log plugin & an SEO audit plugin.`
409	`- Unit tests.`
410
411	`0.1.1`
412	`- Add proxy support.`
413	`- Gives the possibility to crawl (or not) the external domains which is different than crawling only the external links. Crawl external links means to check the http status & content of the linked external resources which is different of expand the crawl through the entire external domains.`
414
415	`0.1.2`
416	`- Review Log component.`
417	`- set the default userAgent to NinjaBot.`
418	`- update README.`
419
420	`0.1.3`
421	`- avoid crash for long crawls.`
422
423	`0.1.4`
424	`- code refactoring in order to make a tentative of multicore proccess for making http requests`
425
426	`0.1.5`
427	`- remove the multicore support for making http requests due to the important overhead. Plan to use multicore for some intensive CPU plugins.`
428	`- refactor the rate limit and http request retries in case of errors.`
429
430	`0.1.6`
431	`- Review logger : use winston, different log files : the full crawl, errors and urls. Gives the possibility to create a specific logger for a plugin.`
432
433	`0.1.7`
434	`- Too many issues with winston, use Bunyan for the logs`
435	`- Refactor how to set the urls in the crawl option : simple url, an array of urls or of json option objects.`
436	`- Review the doc aka README`
437	`- Review how to manage the timeouts in function of the site to crawl. If too many timeouts for one domain, the crawler will change the settings in order to decrease request concurrency. If errors persist, the crawler will stop to crawl this domain.`
438	`- Add support for a blacklist of domains.`
439
440	`0.1.8`
441
442	`- Add options to limit the crawl for one host or one entire domain.`
443	`- Add option to follow robots.txt rules.`