UNPKG

13.8 kBMarkdownView Raw
1Crawler Ninja
2-------------
3
4This crawler aims to build custom solutions for crawling/scraping sites.
5For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
6
7This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
8
9The best environment to run Crawler Ninja is a linux server.
10
11
12Help & Forks welcomed ! or please wait ... work in progress !
13
14How to install
15--------------
16
17 $ npm install crawler-ninja --save
18
19
20Crash course
21------------
22###How to use an existing plugin ?
23
24```javascript
25var crawler = require("crawler-ninja");
26var logger = require("crawler-ninja/plugins/log-plugin");
27
28var c = new crawler.Crawler();
29var log = new logger.Plugin(c);
30
31c.on("end", function() {
32
33 var end = new Date();
34 console.log("End of crawl !, done in : " + (end - start));
35
36
37});
38
39var start = new Date();
40c.queue({url : "http://www.mysite.com/"});
41```
42This script logs on the console all crawled pages thanks to the usage of the log-plugin component.
43
44The Crawler component emits different kind of events that plugins can use (see below).
45When the crawl ends, the event 'end' is emitted.
46
47###Create a new plugin
48
49The following script show you the events callbacks that your have to implement for creating a new plugin.
50
51This is not mandatory to implement all crawler events. You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).
52
53
54```javascript
55
56// userfull lib for managing uri
57var URI = require('crawler/lib/uri');
58
59
60function Plugin(crawler) {
61
62 this.crawler = crawler;
63
64 /**
65 * Emits when the crawler found an error
66 *
67 * @param the usual error object
68 * @param the result of the request (contains uri, headers, ...)
69 */
70 this.crawler.on("error", function(error, result) {
71
72 });
73
74 /**
75 * Emits when the crawler crawls a resource (html,js,css, pdf, ...)
76 *
77 * @param result : the result of the crawled resource
78 * @param the jquery like object for accessing to the HTML tags. Null is if the resource is not an HTML.
79 * See the project cheerio : https://github.com/cheeriojs/cheerio
80 */
81 this.crawler.on("crawl", function(result,$) {
82
83 });
84
85 /**
86 * Emits when the crawler founds a link in a page
87 *
88 * @param the page that contains the link
89 * @param the link uri
90 * @param the anchor text
91 * @param true if the link is do follow
92 */
93 this.crawler.on("crawlLink", function(page, link, anchor, isDoFollow) {
94
95 });
96
97
98 /**
99 * Emits when the crawler founds an image
100 *
101 * @param the page that contains the image
102 * @param the image uri
103 * @param the alt text
104 */
105 this.crawler.on("crawlImage", function(page, link, alt) {
106
107
108 });
109
110 /**
111 * Emits when the crawler founds a redirect 3**
112 *
113 * @param the from url
114 * @param the to url
115 * @param statusCode : the exact status code : 301, 302, ...
116 */
117 this.crawler.on("crawlRedirect", function(from, to, statusCode) {
118
119 });
120
121}
122
123module.exports.Plugin = Plugin;
124
125```
126
127
128Option references
129-----------------
130
131
132### The main crawler config options
133
134You can pass these options to the Crawler() constructor like :
135
136```javascript
137
138
139var c = new crawler.Crawler({
140 externalLinks : true,
141 scripts : false,
142 images : false
143});
144
145
146```
147
148- maxConnections : the number of connections used to crawl, default is 10.
149- externalLinks : if true crawl external links, default = false.
150- externalDomains : if true crawl the external domains. This option can crawl a lot of different linked domains.
151- scripts : if true crawl script tags, default = true.
152- links : if true crawl link tags, default = true.
153- linkTypes : the type of the links tags to crawl (match to the rel attribute), default = ["canonical", "stylesheet"].
154- images : if true crawl images, default = true.
155- protocols : list of the protocols to crawl, default = ["http", "https"].
156- timeout : timeout per requests in milliseconds, default = 20000.
157- retries : number of retries if the request is on timeout, default = 3.
158- retryTimeout : number of milliseconds to wait before retrying, default = 10000.
159- maxErrors : number of timeout errors before forcing to decrease the crawl rate, default is 5. If the value is -1, there is no check.
160- errorRates : an array of crawl rates to apply if there are no too many errors, default : [200,350,500] (in ms)
161- skipDuplicates : if true skips URIs that were already crawled, default is true.
162- rateLimits : number of milliseconds to delay between each requests , default = 0.
163- depthLimit : the depth limit for the crawl, default is no limit.
164- followRedirect : if true, the crawl will not return the 301, it will
165 follow directly the redirection, default is false.
166- userAgent : String, defaults to "node-crawler/[version]"
167- referer : String, if truthy sets the HTTP referer header
168- domainBlackList : The list of domain names (without tld) to avoid to crawl (an array of String). The default list is in the file : /default-lists/domain-black-list.js
169- proxyList : The list of proxy to use for each crawler request (see below).
170
171
172You can also use the [mikeal's request options](https://github.com/mikeal/request#requestoptions-callback) and will be directly passed to the request() method.
173
174You can pass these options to the Crawler() constructor if you want them to be global or as
175items in the queue() calls if you want them to be specific to that item (overwriting global options)
176
177
178
179### Add your own crawl rules
180
181If the predefined options are not sufficiants, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.
182
183
184```javascript
185
186
187var c = new crawler.Crawler({
188 // add here predefined options you want to override
189
190 /**
191 * this callback is called for each link found in an html page
192 * @param : the uri of the page that contains the link
193 * @param : the uri of the link to check
194 * @param : the anchor text of the link
195 * @param : true if the link is dofollow
196 * @return : true if the crawler can crawl the link on this html page
197 */
198 canCrawl : function(htlmPage, link, anchor, isDoFollow) {
199 return isDoFollow;
200 }
201
202});
203
204
205```
206
207
208Using proxies
209-------------
210
211Crawler.ninja can be configured to execute each http request through a proxy.
212It uses the npm package [simple-proxies](https://github.com/christophebe/simple-proxies).
213
214You have to install it in your project with the command :
215
216 $ npm install simple-proxies --save
217
218
219Here is a code sample that uses proxies from a file :
220
221```javascript
222var proxyLoader = require("simple-proxies/lib/proxyfileloader");
223var crawler = require("crawler-ninja");
224var logger = require("crawler-ninja/plugins/log-plugin");
225
226
227var proxyFile = "proxies.txt";
228
229// Load proxies
230var config = proxyLoader.config()
231 .setProxyFile(proxyFile)
232 .setCheckProxies(false)
233 .setRemoveInvalidProxies(false);
234
235proxyLoader.loadProxyFile(config, function(error, proxyList) {
236 if (error) {
237 console.log(error);
238
239 }
240 else {
241 crawl(proxyList);
242 }
243
244});
245
246
247function crawl(proxyList){
248 var c = new crawler.Crawler({
249 externalLinks : true,
250 images : false,
251 scripts : false,
252 links : false, //link tags used for css, canonical, ...
253 followRedirect : true,
254 proxyList : proxyList
255 });
256
257 var log = new logger.Plugin(c);
258
259 c.on("end", function() {
260
261 var end = new Date();
262 console.log("Well done Sir !, done in : " + (end - start));
263
264
265 });
266
267 var start = new Date();
268 c.queue({url : "http://www.site.com"});
269}
270
271```
272
273Using the crawl logger in your own plugin
274------------------------------------------
275
276The current crawl logger is based on [Bunyan](https://github.com/trentm/node-bunyan).
277- It logs the all crawl actions & errors in the file ./logs/crawler.log.
278- The list of errors can be found in ./logs/errors.log.
279
280You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.
281
282You can also use the current logger or create a new one in your own Plugin.
283
284*Use default loggers*
285
286
287```javascript
288
289var log = require("crawler-ninja./lib/logger.js").Logger;
290
291log.info("log info"); // Log into crawler.log
292log.debug("log debug"); // Log into crawler.log
293log.error("log error"); // Log into crawler.log & errors.log
294log.info({statusCode : 200, url: "http://www.google.com" }) // log a json
295```
296
297*Create a new logger for your plugin*
298
299```javascript
300// Log into crawler.log
301var log = require("crawler-ninja/lib/logger.js");
302
303var myLog = log.createLogger("myLoggerName", "./logs/myplugin.log");
304
305myLog.log({url:"http://www.google.com", pageRank : 10});
306
307```
308
309Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.
310
311More features & flexibilities will be added in the upcoming releases.
312
313
314Control the crawl rate
315-----------------------
316All sites cannot support an intensive crawl. This crawl provide 2 solutions to control the crawl rates :
317- implicit : the crawl decrease the crawl rate if there are too many timeouts on a host. the crawl rate is controlled for each crawled hosts separately.
318- explicit : you can specify the crawl rate in the crawler config. This setting is unique for all hosts.
319
320
321**Implicit setting**
322
323Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timouts persist, the crawler will cancel the crawl on that host.
324
325You can change the default values for this implicit setting (5 timout errors & rates = 200, 350, 500ms). Here is an example :
326
327```javascript
328var crawler = require("crawler-ninja");
329var logger = require("crawler-ninja/plugins/log-plugin");
330
331var c = new crawler.Crawler({
332 // new values for the implicit setting
333 maxErrors : 5,
334 errorRates : [300, 600, 900]
335
336});
337
338var log = new logger.Plugin(c);
339
340c.on("end", function() {
341
342 var end = new Date();
343 console.log("End of crawl !, done in : " + (end - start));
344
345
346});
347
348var start = new Date();
349c.queue({url : "http://www.mysite.com/"});
350```
351Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting
352
353**Explicit setting**
354
355In this configuration, you are apply the same crawl rate for all requests on all hosts.
356
357```javascript
358var crawler = require("crawler-ninja");
359var logger = require("crawler-ninja/plugins/log-plugin");
360
361var c = new crawler.Crawler({
362 rateLimits : 200 //200ms between each request
363
364});
365
366var log = new logger.Plugin(c);
367
368c.on("end", function() {
369
370 var end = new Date();
371 console.log("End of crawl !, done in : " + (end - start));
372
373
374});
375
376var start = new Date();
377c.queue({url : "http://www.mysite.com/"});
378```
379
380
381If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".
382
383Current Plugins
384---------------
385
386- Log
387- Stat
388- Audit
389
390
391Rough todolist
392--------------
393
394 * More & more plugins (in progress)
395 * Use Riak as default persistence layer/Crawler Store
396 * Multicore architecture and/or micro service architecture for plugins that requires a lot of CPU usage
397 * CLI for extracting data from the Crawl DB
398 * Build UI : dashboards, view project data, ...
399
400
401ChangeLog
402---------
403
4040.1.0
405 - crawler engine that support navigation through a.href, detect images, links tag & scripts.
406 - Add flexible parameters to crawl (see the section crawl option above) like the crawl depth, crawl rates, craw external links, ...
407 - Implement a basic log plugin & an SEO audit plugin.
408 - Unit tests.
409
4100.1.1
411 - Add proxy support.
412 - Gives the possibility to crawl (or not) the external domains which is different than crawling only the external links. Crawl external links means to check the http status & content of the linked external resources which is different of expand the crawl through the entire external domains.
413
4140.1.2
415 - Review Log component.
416 - set the default userAgent to NinjaBot.
417 - update README.
418
4190.1.3
420 - avoid crash for long crawls.
421
4220.1.4
423 - code refactoring in order to make a tentative of multicore proccess for making http requests
424
4250.1.5
426 - remove the multicore support for making http requests due to the important overhead. Plan to use multicore for some intensive CPU plugins.
427 - refactor the rate limit and http request retries in case of errors.
428
4290.1.6
430 - Review logger : use winston, different log files : the full crawl, errors and urls. Gives the possibility to create a specific logger for a plugin.
431
4320.1.7
433 - Too many issues with winston, use Bunyan for the logs
434 - Refactor how to set the urls in the crawl option : simple url, an array of urls or of json option objects.
435 - Review the doc aka README
436 - Review how to manage the timeouts in function of the site to crawl. If too many timeouts for one domain, the crawler will change the settings in order to decrease request concurrency. If errors persist, the crawler will stop to crawl this domain.
437 - Add support for a blacklist of domains.