UNPKG

crawler-ninja/README.md

Version:

15.2 kBMarkdownView Raw

1Crawler Ninja
2-------------
3
4This crawler aims to build custom solutions for crawling/scraping sites.
5For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
6
7This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
8
9The best environment to run Crawler Ninja is a linux server.
10
11
12Help & Forks welcomed ! or please wait ... work in progress !
13
14How to install
15--------------
16
17    $ npm install crawler-ninja --save
18
19
20Crash course
21------------
22###How to use an existing plugin ?
23
24```javascript
25var crawler = require("crawler-ninja");
26var cs      = require("crawler-ninja/plugins/console-plugin");
27
28var c = new crawler.Crawler();
29var consolePlugin = new cs.Plugin();
30c.registerPlugin(consolePlugin);
31
32c.on("end", function() {
33
34    var end = new Date();
35    console.log("End of crawl !");
36
37
38});
39
40c.queue({url : "http://www.mysite.com/"});
41```
42This script logs on the console all crawled pages thanks to the usage of the console-plugin component.
43
44The Crawler calls plugin functions in function of what kind of object is crawling (html pages, css, script, links, redirection, ...).
45When the crawl ends, the event 'end' is emitted.
46
47###Create a new plugin
48
49The following script show you the events callbacks that your have to implement for creating a new plugin.
50
51This is not mandatory to implement all plugin functions. You can also reduce the scope of the crawl by using the different crawl options (see below the section :  option references).
52
53
54```javascript
55function Plugin() {
56
57
58}
59
60/**
61 * Function triggers when an Http error occurs for request made by the crawler
62 *
63 * @param the http error
64 * @param the http resource object (contains the uri of the resource)
65 * @param callback(error)
66 */
67Plugin.prototype.error = function (error, result, callback) {
68
69}
70
71/**
72 * Function triggers when an html resource is crawled
73 *
74 * @param result : the result of the resource crawl
75 * @param the jquery like object for accessing to the HTML tags. Null is the resource
76 *        is not an HTML
77 * @param callback(error)
78 */
79Plugin.prototype.crawl = function(result, $, callback) {
80
81}
82
83/**
84 * Function triggers when the crawler found a link on a page
85 *
86 * @param the page url that contains the link
87 * @param the link found in the page
88 * @param the link anchor text
89 * @param true if the link is on follow
90 * @param callback(error)
91 */
92Plugin.prototype.crawlLink = function(page, link, anchor, isDoFollow, callback) {
93
94
95}
96
97/**
98 * Function triggers when the crawler found an image on a page
99 *
100 * @param the page url that contains the image
101 * @param the image link found in the page
102 * @param the image alt
103 * @param callback(error)
104 *
105 */
106Plugin.prototype.crawlImage = function(page, link, alt, callback) {
107
108
109}
110
111/**
112 * Function triggers when the crawler found an HTTP redirect
113 * @param the from url
114 * @param the to url
115 * @param the redirect code (301, 302, ...)
116 * @param callback(error)
117 *
118 */
119Plugin.prototype.crawlRedirect = function(from, to, statusCode, callback) {
120
121}
122
123/**
124 * Function triggers when a link is not crawled (depending on the crawler setting)
125 *
126 * @param the page url that contains the link
127 * @param the link found in the page
128 * @param the link anchor text
129 * @param true if the link is on follow
130 * @param callback(error)
131 *
132 */
133Plugin.prototype.unCrawl = function(page, link, anchor, isDoFollow, endCallback) {
134
135}
136
137module.exports.Plugin = Plugin;
138
139
140```
141
142
143Option references
144-----------------
145
146
147### The main crawler config options
148
149You can pass these options to the Crawler() constructor like :
150
151```javascript
152
153
154var c = new crawler.Crawler({
155  externalDomains : true,
156  scripts : false,
157  images : false
158});
159
160
161```
162- skipDuplicates        : if true skips URIs that were already crawled, default is true.
163- userAgent             : String, defaults to "NinjaBot"
164- maxConnections        : the number of connections used to crawl, default is 30.
165- externalDomains       : if true crawl the -external domains. This option can crawl a lot of different linked domains, default = false.
166- externalHosts         : if true crawl the others hosts on the same domain, default = false.
167- firstExternalLinkOnly : crawl only the first link found for external domains/hosts. externalHosts and/or externalDomains should be = true
168- scripts               : if true crawl script tags, default = true.
169- links                 : if true crawl link tags, default = true.
170- linkTypes             : the type of the links tags to crawl (match to the rel attribute), default = ["canonical", "stylesheet"].
171- images                : if true crawl images, default = true.
172- depthLimit            : the depth limit for the crawl, default is no limit.
173- protocols             : list of the protocols to crawl, default = ["http", "https"].
174- timeout               : max timeout per requests in milliseconds, default = 20000.
175- retryTimeout          : Time to wait before retrying the same request due to a timeout, default : 10000
176- retries               : number of retries if the request is on timeout, default = 3.
177- retryTimeout          : number of milliseconds to wait before retrying,  default = 10000.
178- maxErrors             : number of timeout errors before forcing to decrease the crawl rate, default is 5. If the value is -1, there is no check.
179- errorRates            : an array of crawl rates to apply if there are no too many errors on one host, default : [200,350,500] (in ms)
180- rateLimits            : number of milliseconds to delay between each requests , default = 0.
181- followRedirect        : if true, the crawl will not return the 301, it will follow directly the redirection, default is false.
182- referer               : String, if truthy sets the HTTP referer header
183- domainBlackList       : The list of domain names (without tld) to avoid to crawl (an array of String). The default list is in the file : /default-lists/domain-black-list.js
184- suffixBlackList       : The list of url suffice to avoid to crawl (an array of String). The default list is in the file : /default-lists/domain-black-list.js
185- proxyList             : The list of proxy to use for each crawler request (see below).
186
187
188You can also use the [mikeal's request options](https://github.com/mikeal/request#requestoptions-callback) and will be directly passed to the request() method.
189
190You can pass these options to the Crawler() constructor if you want them to be global or as
191items in the queue() calls if you want them to be specific to that item (overwriting global options)
192
193
194
195### Add your own crawl rules
196
197If the predefined options are not sufficients, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.
198
199
200```javascript
201
202
203var c = new crawler.Crawler({
204  // add here predefined options you want to override
205
206  /**
207   *  this callback is called for each link found in an html page
208   *  @param  : the uri of the page that contains the link
209   *  @param  : the uri of the link to check
210   *  @param  : the anchor text of the link
211   *  @param  : true if the link is dofollow
212   *  @return : true if the crawler can crawl the link on this html page
213   */
214  canCrawl : function(htlmPage, link, anchor, isDoFollow) {
215      return isDoFollow;
216  }
217
218});
219
220
221```
222
223
224Using proxies
225-------------
226
227Crawler.ninja can be configured to execute each http request through a proxy.
228It uses the npm package [simple-proxies](https://github.com/christophebe/simple-proxies).
229
230You have to install it in your project with the command :
231
232    $ npm install simple-proxies --save
233
234
235Here is a code sample that uses proxies from a file :
236
237```javascript
238var proxyLoader = require("simple-proxies/lib/proxyfileloader");
239var crawler     = require("crawler-ninja");
240
241var proxyFile = "proxies.txt";
242
243// Load proxies
244var config = proxyLoader.config()
245                        .setProxyFile(proxyFile)
246                        .setCheckProxies(false)
247                        .setRemoveInvalidProxies(false);
248
249proxyLoader.loadProxyFile(config, function(error, proxyList) {
250    if (error) {
251      console.log(error);
252
253    }
254    else {
255       crawl(proxyList);
256    }
257
258});
259
260
261function crawl(proxyList){
262    var c = new crawler.Crawler(
263        proxyList : proxyList
264    });
265
266    // Register desired plugins here
267    c.on("end", function() {
268
269        var end = new Date();
270        console.log("Well done Sir !, done in : " + (end - start));
271
272    });
273
274    var start = new Date();
275    c.queue({url : "http://www.site.com"});
276}
277
278```
279
280Using the crawl logger in your own plugin
281------------------------------------------
282
283The current crawl logger is based on [Bunyan](https://github.com/trentm/node-bunyan).
284- It logs the all crawl actions & errors in the file ./logs/crawler.log and a more detailled log into debug.log.
285
286You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.
287
288You can also use the current logger module in your own Plugin.
289
290*Use default loggers*
291
292You have to install the logger module into your own project :
293
294```
295npm install crawler-ninja-logger --save
296
297```
298Then, in your own Plugin code :
299
300```javascript
301
302var log = require("crawler-ninja-logger").Logger;
303
304log.info("log info");  // Log into crawler.log
305log.debug("log debug"); // Log into crawler.log
306log.error("log error"); // Log into crawler.log & errors.log
307log.info({statusCode : 200, url: "http://www.google.com" }) // log a json
308```
309
310The crawler logs with the following structure
311
312```
313log.info({"url" : "url", "step" : "step", "message" : "message", "options" : "options"});
314```
315
316*Create a new logger for your plugin*
317
318```javascript
319// Log into crawler.log
320var log = require("crawler-ninja-logger");
321var myLog = log.createLogger("myLoggerName", {path : "./log-file-name.log"}););
322
323myLog.info({url:"http://www.google.com", pageRank : 10});
324
325```
326
327Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.  
328
329More features & flexibilities will be added in the upcoming releases.
330
331
332Control the crawl rate
333-----------------------
334All sites cannot support an intensive crawl. This crawl provides 2 solutions to control the crawl rates :
335- implicit : the crawl decrease the crawl rate if there are too many timeouts on a host. the crawl rate is controlled for each crawled hosts separately.
336- explicit : you can specify the crawl rate in the crawler config. This setting is unique for all hosts.
337
338
339**Implicit setting**
340
341Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timeout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timeouts persists, the crawler will cancel the crawl on that host.
342
343You can change the default values for this implicit setting (5 timeout errors & rates = 200, 350, 500ms). Here is an example :
344
345```javascript
346var crawler = require("crawler-ninja");
347
348var c = new crawler.Crawler({
349  // new values for the implicit setting
350  maxErrors : 5,
351  errorRates : [300, 600, 900]
352
353});
354
355// Register your plugins here
356c.on("end", function() {
357    var end = new Date();
358    console.log("End of crawl !, done in : " + (end - start));
359
360});
361
362var start = new Date();
363c.queue({url : "http://www.mysite.com/"});
364```
365Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting
366
367**Explicit setting**
368
369In this configuration, you are apply the same crawl rate for all requests on all hosts (even for successful requests).
370
371```javascript
372var crawler = require("crawler-ninja");
373var logger  = require("crawler-ninja/plugins/log-plugin");
374
375var c = new crawler.Crawler({
376  rateLimits         : 200 //200ms between each request
377
378});
379
380var log = new logger.Plugin(c);
381
382c.on("end", function() {
383
384    var end = new Date();
385    console.log("End of crawl !, done in : " + (end - start));
386
387
388});
389
390var start = new Date();
391c.queue({url : "http://www.mysite.com/"});
392```
393
394
395If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".
396
397Current Plugins
398---------------
399
400- Log
401- Stat
402- Audit
403
404
405Rough todolist
406--------------
407
408 * More & more plugins (in progress)
409 * Use Riak as default persistence layer/Crawler Store
410 * Multicore architecture and/or micro service architecture for plugins that requires a lot of CPU usage
411 * CLI for extracting data from the Crawl DB
412 * Build UI : dashboards, view project data, ...
413
414
415ChangeLog
416---------
417
4180.1.0
419 - crawler engine that support navigation through a.href, detect images, links tag & scripts.
420 - Add flexible parameters to crawl (see the section crawl option above) like the crawl depth, crawl rates, craw external links, ...
421 - Implement a basic log plugin & an SEO audit plugin.
422 - Unit tests.
423
4240.1.1
425 - Add proxy support.
426 - Gives the possibility to crawl (or not) the external domains which is different than crawling only the external links. Crawl external links means to check the http status & content of the linked external resources which is different of expand the crawl through the entire external domains.
427
4280.1.2
429 - Review Log component.
430 - set the default userAgent to NinjaBot.
431 - update README.
432
4330.1.3
434 - avoid crash for long crawls.
435
4360.1.4
437 - code refactoring in order to make a tentative of multicore proccess for making http requests
438
4390.1.5
440  - remove the multicore support for making http requests due to the important overhead. Plan to use multicore for some intensive CPU plugins.
441  - refactor the rate limit and http request retries in case of errors.
442
4430.1.6
444 - Review logger : use winston, different log files : the full crawl, errors and urls. Gives the possibility to create a specific logger for a plugin.
445
4460.1.7
447  - Too many issues with winston, use Bunyan for the logs
448  - Refactor how to set the urls in the crawl option : simple url, an array of urls or of json option objects.
449  - Review the doc aka README
450  - Review how to manage the timeouts in function of the site to crawl. If too many timeouts for one domain, the crawler will change the settings in order to decrease request concurrency. If errors persist, the crawler will stop to crawl this domain.
451  - Add support for a blacklist of domains.
452
4530.1.8
454
455- Add options to limit the crawl for one host or one entire domain.
456
4570.1.9
458
459- Bug fix : newest Bunyan version doesn't create the log dir.
460- Manage more request errors type (with or without retries)
461- Add a suffix black list in order to exclude the crawl with a specific suffix (extention) like .pdf,.docx, ...
462
4630.1.10
464- Use callbacks instead of events for the plugin management
465
4660.1.11
467- Externalize the log mechanism into the module crawler-ninja-logger
468
4690.1.12
470- Review black lists (domains & suffixs)
471- Review README
472- Bug fixs
473- Add an empty plugin sample. See the js file : /plugins/empty-plugin.js

1	`Crawler Ninja`
2	`-------------`
3
4	`This crawler aims to build custom solutions for crawling/scraping sites.`
5	`For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...`
6
7	`This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.`
8
9	`The best environment to run Crawler Ninja is a linux server.`
10
11
12	`Help & Forks welcomed ! or please wait ... work in progress !`
13
14	`How to install`
15	`--------------`
16
17	`$ npm install crawler-ninja --save`
18
19
20	`Crash course`
21	`------------`
22	`###How to use an existing plugin ?`
23
24	```javascript
25	`var crawler = require("crawler-ninja");`
26	`var cs = require("crawler-ninja/plugins/console-plugin");`
27
28	`var c = new crawler.Crawler();`
29	`var consolePlugin = new cs.Plugin();`
30	`c.registerPlugin(consolePlugin);`
31
32	`c.on("end", function() {`
33
34	`var end = new Date();`
35	`console.log("End of crawl !");`
36
37
38	`});`
39
40	`c.queue({url : "http://www.mysite.com/"});`
41	```
42	`This script logs on the console all crawled pages thanks to the usage of the console-plugin component.`
43
44	`The Crawler calls plugin functions in function of what kind of object is crawling (html pages, css, script, links, redirection, ...).`
45	`When the crawl ends, the event 'end' is emitted.`
46
47	`###Create a new plugin`
48
49	`The following script show you the events callbacks that your have to implement for creating a new plugin.`
50
51	`This is not mandatory to implement all plugin functions. You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).`
52
53
54	```javascript
55	`function Plugin() {`
56
57
58	`}`
59
60	`/**`
61	`* Function triggers when an Http error occurs for request made by the crawler`
62	`*`
63	`* @param the http error`
64	`* @param the http resource object (contains the uri of the resource)`
65	`* @param callback(error)`
66	`*/`
67	`Plugin.prototype.error = function (error, result, callback) {`
68
69	`}`
70
71	`/**`
72	`* Function triggers when an html resource is crawled`
73	`*`
74	`* @param result : the result of the resource crawl`
75	`* @param the jquery like object for accessing to the HTML tags. Null is the resource`
76	`* is not an HTML`
77	`* @param callback(error)`
78	`*/`
79	`Plugin.prototype.crawl = function(result, $, callback) {`
80
81	`}`
82
83	`/**`
84	`* Function triggers when the crawler found a link on a page`
85	`*`
86	`* @param the page url that contains the link`
87	`* @param the link found in the page`
88	`* @param the link anchor text`
89	`* @param true if the link is on follow`
90	`* @param callback(error)`
91	`*/`
92	`Plugin.prototype.crawlLink = function(page, link, anchor, isDoFollow, callback) {`
93
94
95	`}`
96
97	`/**`
98	`* Function triggers when the crawler found an image on a page`
99	`*`
100	`* @param the page url that contains the image`
101	`* @param the image link found in the page`
102	`* @param the image alt`
103	`* @param callback(error)`
104	`*`
105	`*/`
106	`Plugin.prototype.crawlImage = function(page, link, alt, callback) {`
107
108
109	`}`
110
111	`/**`
112	`* Function triggers when the crawler found an HTTP redirect`
113	`* @param the from url`
114	`* @param the to url`
115	`* @param the redirect code (301, 302, ...)`
116	`* @param callback(error)`
117	`*`
118	`*/`
119	`Plugin.prototype.crawlRedirect = function(from, to, statusCode, callback) {`
120
121	`}`
122
123	`/**`
124	`* Function triggers when a link is not crawled (depending on the crawler setting)`
125	`*`
126	`* @param the page url that contains the link`
127	`* @param the link found in the page`
128	`* @param the link anchor text`
129	`* @param true if the link is on follow`
130	`* @param callback(error)`
131	`*`
132	`*/`
133	`Plugin.prototype.unCrawl = function(page, link, anchor, isDoFollow, endCallback) {`
134
135	`}`
136
137	`module.exports.Plugin = Plugin;`
138
139
140	```
141
142
143	`Option references`
144	`-----------------`
145
146
147	`### The main crawler config options`
148
149	`You can pass these options to the Crawler() constructor like :`
150
151	```javascript
152
153
154	`var c = new crawler.Crawler({`
155	`externalDomains : true,`
156	`scripts : false,`
157	`images : false`
158	`});`
159
160
161	```
162	`- skipDuplicates : if true skips URIs that were already crawled, default is true.`
163	`- userAgent : String, defaults to "NinjaBot"`
164	`- maxConnections : the number of connections used to crawl, default is 30.`
165	`- externalDomains : if true crawl the -external domains. This option can crawl a lot of different linked domains, default = false.`
166	`- externalHosts : if true crawl the others hosts on the same domain, default = false.`
167	`- firstExternalLinkOnly : crawl only the first link found for external domains/hosts. externalHosts and/or externalDomains should be = true`
168	`- scripts : if true crawl script tags, default = true.`
169	`- links : if true crawl link tags, default = true.`
170	`- linkTypes : the type of the links tags to crawl (match to the rel attribute), default = ["canonical", "stylesheet"].`
171	`- images : if true crawl images, default = true.`
172	`- depthLimit : the depth limit for the crawl, default is no limit.`
173	`- protocols : list of the protocols to crawl, default = ["http", "https"].`
174	`- timeout : max timeout per requests in milliseconds, default = 20000.`
175	`- retryTimeout : Time to wait before retrying the same request due to a timeout, default : 10000`
176	`- retries : number of retries if the request is on timeout, default = 3.`
177	`- retryTimeout : number of milliseconds to wait before retrying, default = 10000.`
178	`- maxErrors : number of timeout errors before forcing to decrease the crawl rate, default is 5. If the value is -1, there is no check.`
179	`- errorRates : an array of crawl rates to apply if there are no too many errors on one host, default : [200,350,500] (in ms)`
180	`- rateLimits : number of milliseconds to delay between each requests , default = 0.`
181	`- followRedirect : if true, the crawl will not return the 301, it will follow directly the redirection, default is false.`
182	`- referer : String, if truthy sets the HTTP referer header`
183	`- domainBlackList : The list of domain names (without tld) to avoid to crawl (an array of String). The default list is in the file : /default-lists/domain-black-list.js`
184	`- suffixBlackList : The list of url suffice to avoid to crawl (an array of String). The default list is in the file : /default-lists/domain-black-list.js`
185	`- proxyList : The list of proxy to use for each crawler request (see below).`
186
187
188	`You can also use the [mikeal's request options](https://github.com/mikeal/request#requestoptions-callback) and will be directly passed to the request() method.`
189
190	`You can pass these options to the Crawler() constructor if you want them to be global or as`
191	`items in the queue() calls if you want them to be specific to that item (overwriting global options)`
192
193
194
195	`### Add your own crawl rules`
196
197	`If the predefined options are not sufficients, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.`
198
199
200	```javascript
201
202
203	`var c = new crawler.Crawler({`
204	`// add here predefined options you want to override`
205
206	`/**`
207	`* this callback is called for each link found in an html page`
208	`* @param : the uri of the page that contains the link`
209	`* @param : the uri of the link to check`
210	`* @param : the anchor text of the link`
211	`* @param : true if the link is dofollow`
212	`* @return : true if the crawler can crawl the link on this html page`
213	`*/`
214	`canCrawl : function(htlmPage, link, anchor, isDoFollow) {`
215	`return isDoFollow;`
216	`}`
217
218	`});`
219
220
221	```
222
223
224	`Using proxies`
225	`-------------`
226
227	`Crawler.ninja can be configured to execute each http request through a proxy.`
228	`It uses the npm package [simple-proxies](https://github.com/christophebe/simple-proxies).`
229
230	`You have to install it in your project with the command :`
231
232	`$ npm install simple-proxies --save`
233
234
235	`Here is a code sample that uses proxies from a file :`
236
237	```javascript
238	`var proxyLoader = require("simple-proxies/lib/proxyfileloader");`
239	`var crawler = require("crawler-ninja");`
240
241	`var proxyFile = "proxies.txt";`
242
243	`// Load proxies`
244	`var config = proxyLoader.config()`
245	`.setProxyFile(proxyFile)`
246	`.setCheckProxies(false)`
247	`.setRemoveInvalidProxies(false);`
248
249	`proxyLoader.loadProxyFile(config, function(error, proxyList) {`
250	`if (error) {`
251	`console.log(error);`
252
253	`}`
254	`else {`
255	`crawl(proxyList);`
256	`}`
257
258	`});`
259
260
261	`function crawl(proxyList){`
262	`var c = new crawler.Crawler(`
263	`proxyList : proxyList`
264	`});`
265
266	`// Register desired plugins here`
267	`c.on("end", function() {`
268
269	`var end = new Date();`
270	`console.log("Well done Sir !, done in : " + (end - start));`
271
272	`});`
273
274	`var start = new Date();`
275	`c.queue({url : "http://www.site.com"});`
276	`}`
277
278	```
279
280	`Using the crawl logger in your own plugin`
281	`------------------------------------------`
282
283	`The current crawl logger is based on [Bunyan](https://github.com/trentm/node-bunyan).`
284	`- It logs the all crawl actions & errors in the file ./logs/crawler.log and a more detailled log into debug.log.`
285
286	`You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.`
287
288	`You can also use the current logger module in your own Plugin.`
289
290	`Use default loggers`
291
292	`You have to install the logger module into your own project :`
293
294	```
295	`npm install crawler-ninja-logger --save`
296
297	```
298	`Then, in your own Plugin code :`
299
300	```javascript
301
302	`var log = require("crawler-ninja-logger").Logger;`
303
304	`log.info("log info"); // Log into crawler.log`
305	`log.debug("log debug"); // Log into crawler.log`
306	`log.error("log error"); // Log into crawler.log & errors.log`
307	`log.info({statusCode : 200, url: "http://www.google.com" }) // log a json`
308	```
309
310	`The crawler logs with the following structure`
311
312	```
313	`log.info({"url" : "url", "step" : "step", "message" : "message", "options" : "options"});`
314	```
315
316	`Create a new logger for your plugin`
317
318	```javascript
319	`// Log into crawler.log`
320	`var log = require("crawler-ninja-logger");`
321	`var myLog = log.createLogger("myLoggerName", {path : "./log-file-name.log"}););`
322
323	`myLog.info({url:"http://www.google.com", pageRank : 10});`
324
325	```
326
327	`Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.`
328
329	`More features & flexibilities will be added in the upcoming releases.`
330
331
332	`Control the crawl rate`
333	`-----------------------`
334	`All sites cannot support an intensive crawl. This crawl provides 2 solutions to control the crawl rates :`
335	`- implicit : the crawl decrease the crawl rate if there are too many timeouts on a host. the crawl rate is controlled for each crawled hosts separately.`
336	`- explicit : you can specify the crawl rate in the crawler config. This setting is unique for all hosts.`
337
338
339	`Implicit setting`
340
341	`Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timeout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timeouts persists, the crawler will cancel the crawl on that host.`
342
343	`You can change the default values for this implicit setting (5 timeout errors & rates = 200, 350, 500ms). Here is an example :`
344
345	```javascript
346	`var crawler = require("crawler-ninja");`
347
348	`var c = new crawler.Crawler({`
349	`// new values for the implicit setting`
350	`maxErrors : 5,`
351	`errorRates : [300, 600, 900]`
352
353	`});`
354
355	`// Register your plugins here`
356	`c.on("end", function() {`
357	`var end = new Date();`
358	`console.log("End of crawl !, done in : " + (end - start));`
359
360	`});`
361
362	`var start = new Date();`
363	`c.queue({url : "http://www.mysite.com/"});`
364	```
365	`Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting`
366
367	`Explicit setting`
368
369	`In this configuration, you are apply the same crawl rate for all requests on all hosts (even for successful requests).`
370
371	```javascript
372	`var crawler = require("crawler-ninja");`
373	`var logger = require("crawler-ninja/plugins/log-plugin");`
374
375	`var c = new crawler.Crawler({`
376	`rateLimits : 200 //200ms between each request`
377
378	`});`
379
380	`var log = new logger.Plugin(c);`
381
382	`c.on("end", function() {`
383
384	`var end = new Date();`
385	`console.log("End of crawl !, done in : " + (end - start));`
386
387
388	`});`
389
390	`var start = new Date();`
391	`c.queue({url : "http://www.mysite.com/"});`
392	```
393
394
395	`If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".`
396
397	`Current Plugins`
398	`---------------`
399
400	`- Log`
401	`- Stat`
402	`- Audit`
403
404
405	`Rough todolist`
406	`--------------`
407
408	`* More & more plugins (in progress)`
409	`* Use Riak as default persistence layer/Crawler Store`
410	`* Multicore architecture and/or micro service architecture for plugins that requires a lot of CPU usage`
411	`* CLI for extracting data from the Crawl DB`
412	`* Build UI : dashboards, view project data, ...`
413
414
415	`ChangeLog`
416	`---------`
417
418	`0.1.0`
419	`- crawler engine that support navigation through a.href, detect images, links tag & scripts.`
420	`- Add flexible parameters to crawl (see the section crawl option above) like the crawl depth, crawl rates, craw external links, ...`
421	`- Implement a basic log plugin & an SEO audit plugin.`
422	`- Unit tests.`
423
424	`0.1.1`
425	`- Add proxy support.`
426	`- Gives the possibility to crawl (or not) the external domains which is different than crawling only the external links. Crawl external links means to check the http status & content of the linked external resources which is different of expand the crawl through the entire external domains.`
427
428	`0.1.2`
429	`- Review Log component.`
430	`- set the default userAgent to NinjaBot.`
431	`- update README.`
432
433	`0.1.3`
434	`- avoid crash for long crawls.`
435
436	`0.1.4`
437	`- code refactoring in order to make a tentative of multicore proccess for making http requests`
438
439	`0.1.5`
440	`- remove the multicore support for making http requests due to the important overhead. Plan to use multicore for some intensive CPU plugins.`
441	`- refactor the rate limit and http request retries in case of errors.`
442
443	`0.1.6`
444	`- Review logger : use winston, different log files : the full crawl, errors and urls. Gives the possibility to create a specific logger for a plugin.`
445
446	`0.1.7`
447	`- Too many issues with winston, use Bunyan for the logs`
448	`- Refactor how to set the urls in the crawl option : simple url, an array of urls or of json option objects.`
449	`- Review the doc aka README`
450	`- Review how to manage the timeouts in function of the site to crawl. If too many timeouts for one domain, the crawler will change the settings in order to decrease request concurrency. If errors persist, the crawler will stop to crawl this domain.`
451	`- Add support for a blacklist of domains.`
452
453	`0.1.8`
454
455	`- Add options to limit the crawl for one host or one entire domain.`
456
457	`0.1.9`
458
459	`- Bug fix : newest Bunyan version doesn't create the log dir.`
460	`- Manage more request errors type (with or without retries)`
461	`- Add a suffix black list in order to exclude the crawl with a specific suffix (extention) like .pdf,.docx, ...`
462
463	`0.1.10`
464	`- Use callbacks instead of events for the plugin management`
465
466	`0.1.11`
467	`- Externalize the log mechanism into the module crawler-ninja-logger`
468
469	`0.1.12`
470	`- Review black lists (domains & suffixs)`
471	`- Review README`
472	`- Bug fixs`
473	`- Add an empty plugin sample. See the js file : /plugins/empty-plugin.js`