UNPKG

crawler/CHANGELOG.md

Version:

6.15 kBMarkdownView Raw

1node-webcrawler ChangeLog
2-------------------------
3
2.1
* [#310](https://github.com/bda-research/node-crawler/issues/310) Upgrade dependencies' version(@mike442144)
* [#303](https://github.com/bda-research/node-crawler/issues/303) Update seenreq to v3(@mike442144)
* [#304](https://github.com/bda-research/node-crawler/pull/304) Replacement of istanbul with nyc (@kossidts)
* [#300](https://github.com/bda-research/node-crawler/pull/300) Add formData arg to requestArgs (@humandevmode)
* [#280](https://github.com/bda-research/node-crawler/pull/280) 20180611 updatetestwithnock (@Dong-Gao)
10
2.0
* [#278](https://github.com/bda-research/node-crawler/pull/278) Added filestream require to download section (@swosko)
* Use `nock` to mock testing instead of httpbin
* Replace jshint by eslint
* Fix code to pass eslint rules
16
1.4
* Tolerate incorrect `Content-Type` header [#270](https://github.com/bda-research/node-crawler/pull/270), [#193](https://github.com/bda-research/node-crawler/issues/193)
* Added examples [#272](https://github.com/bda-research/node-crawler/pull/272), [267](https://github.com/bda-research/node-crawler/issues/267)
* Fixed "skipDuplicates" and "retries" config incompatible bug [#261](https://github.com/bda-research/node-crawler/issues/261)
* Fix typo in README [#268](https://github.com/bda-research/node-crawler/pull/268)
22
1.3
* Upgraded `request.js` and `lodash`
25
1.2
* Recognize all XML MIME types to inject jQuery [#245](https://github.com/bda-research/node-crawler/pull/245)
* Allow options to specify the Agent for Request [#246](https://github.com/bda-research/node-crawler/pull/246)
* Added logo
30
1.1
* added a way to replace the global options.headers keys by queuing options.headers  [#241](https://github.com/bda-research/node-crawler/issues/241)
* fix bug of using last jar object if current options doesn't contain `jar` option [#240](https://github.com/bda-research/node-crawler/issues/240)
* fix bug of encoding [#233](https://github.com/bda-research/node-crawler/issues/233)
* added seenreq options [#208](https://github.com/bda-research/node-crawler/issues/208)
* added preRequest, setLimiterProperty, direct request functions
37
0.5
* fix missing debugging messages [#213](https://github.com/bda-research/node-crawler/issues/213)
* fix bug of 'drain' never called [#210](https://github.com/bda-research/node-crawler/issues/210)
41
0.4
* fix bug of charset detecting [#203](https://github.com/bda-research/node-crawler/issues/203)
* keep node version up to date in travis scripts
45
0.3
* fix bug, skipDuplicate and rotateUA don't work even if set true
48
0.0
* upgrade jsdom up to 9.6.x
* remove 0.10 and 0.12 support [#170](https://github.com/bda-research/node-crawler/issues/170)
* control dependencies version using ^ and ~ [#169](https://github.com/bda-research/node-crawler/issues/169)
* remove node-pool
* notify bottleneck until a task is completed
* replace bottleneck by bottleneckp, which has priority
* change default log function
* use event listener on `request` and `drain` instead of global function [#144](https://github.com/bda-research/node-crawler/issues/144)
* default set forceUTF8 to true
* detect `ESOCKETTIMEDOUT` instead of `ETIMEDOUT` when timeout in test
* add `done` function in callback to avoid async trap
* do not convert response body to string if `encoding` is null [#118](https://github.com/bda-research/node-crawler/issues/118)
* add result document [#68](https://github.com/bda-research/node-crawler/issues/68) [#116](https://github.com/bda-research/node-crawler/issues/116)
* add event `schedule` which is emitted when a task is being added to scheduler
* in callback, move $ into `res` because of weird API
* change rateLimits to rateLimit

7.5
* delete entity in options before copy, and assgin after, `jar` is one of the typical properties which is an `Entity` wich functions [#177](https://github.com/bda-research/node-crawler/issues/177)
* upgrade `request` to version 2.74.0
70
7.4
* change `debug` option to instance level instead of `options`
* update README.md to detail error handling
* call `onDrain` with scope of `this`
* upgrade `seenreq` version to 0.1.7
76
7.0
* cancel recursion in queue
* upgrade `request` version to v2.67.0
80
6.9
* use `bottleneckConcurrent` instead of `maxConnections`, default `10000`
* add debug info
84
6.5
* fix a deep and big bug when initializing Pool, that may lead to sequence execution. [#2](https://github.com/bda-research/node-webcrawler/issues/2)
* print log of Pool status
88
6.3
* you could also get `result.options` from callback even when some errors ouccurred [#127](https://github.com/bda-research/node-crawler/issues/127) [#86](https://github.com/bda-research/node-crawler/issues/86)
* add test for `bottleneck`
92
6.0
* add `bottleneck` to implement rate limit, one can set limit for each connection at same time.

5.2
* you can manually terminate all the resources in your pool, when `onDrain` called, before their timeouts have been reached
* add a read-only property `queueSize` to crawler [#148](https://github.com/bda-research/node-crawler/issues/148) [#76](https://github.com/bda-research/node-crawler/issues/76) [#107](https://github.com/bda-research/node-crawler/issues/107)

5.1
* remove cache feature, it's useless
* add `localAddress`, `time`, `tunnel`, `proxyHeaderWhiteList`, `proxyHeaderExclusiveList` properties to pass to `request` [#155](https://github.com/bda-research/node-crawler/issues/155)
103
5.0
* parse charset from `content-type` in http headers or meta tag in html, then convert
* big5 charset is avaliable as the `iconv-lite` has already supported it 
* default enable gzip in request header
* remove unzip code in crawler since `request` will do this
* body will return as a Buffer if encoding is null which is an option in `request`
* remove cache and skip duplicate `request` for `GET`, `POST`(only for type `urlencode`), `HEAD`
* add log feature, you can use `winston` to set `logger:winston`, or crawler will output to console
* rotate user-agent in case some sites ban your requests


1	`node-webcrawler ChangeLog`
2	`-------------------------`
3
4	`1.2.1`
5	`* [#310](https://github.com/bda-research/node-crawler/issues/310) Upgrade dependencies' version(@mike442144)`
6	`* [#303](https://github.com/bda-research/node-crawler/issues/303) Update seenreq to v3(@mike442144)`
7	`* [#304](https://github.com/bda-research/node-crawler/pull/304) Replacement of istanbul with nyc (@kossidts)`
8	`* [#300](https://github.com/bda-research/node-crawler/pull/300) Add formData arg to requestArgs (@humandevmode)`
9	`* [#280](https://github.com/bda-research/node-crawler/pull/280) 20180611 updatetestwithnock (@Dong-Gao)`
10
11	`1.2.0`
12	`* [#278](https://github.com/bda-research/node-crawler/pull/278) Added filestream require to download section (@swosko)`
13	* Use `nock` to mock testing instead of httpbin
14	`* Replace jshint by eslint`
15	`* Fix code to pass eslint rules`
16
17	`1.1.4`
18	* Tolerate incorrect `Content-Type` header [#270](https://github.com/bda-research/node-crawler/pull/270), [#193](https://github.com/bda-research/node-crawler/issues/193)
19	`* Added examples [#272](https://github.com/bda-research/node-crawler/pull/272), [267](https://github.com/bda-research/node-crawler/issues/267)`
20	`* Fixed "skipDuplicates" and "retries" config incompatible bug [#261](https://github.com/bda-research/node-crawler/issues/261)`
21	`* Fix typo in README [#268](https://github.com/bda-research/node-crawler/pull/268)`
22
23	`1.1.3`
24	* Upgraded `request.js` and `lodash`
25
26	`1.1.2`
27	`* Recognize all XML MIME types to inject jQuery [#245](https://github.com/bda-research/node-crawler/pull/245)`
28	`* Allow options to specify the Agent for Request [#246](https://github.com/bda-research/node-crawler/pull/246)`
29	`* Added logo`
30
31	`1.1.1`
32	`* added a way to replace the global options.headers keys by queuing options.headers [#241](https://github.com/bda-research/node-crawler/issues/241)`
33	* fix bug of using last jar object if current options doesn't contain `jar` option [#240](https://github.com/bda-research/node-crawler/issues/240)
34	`* fix bug of encoding [#233](https://github.com/bda-research/node-crawler/issues/233)`
35	`* added seenreq options [#208](https://github.com/bda-research/node-crawler/issues/208)`
36	`* added preRequest, setLimiterProperty, direct request functions`
37
38	`1.0.5`
39	`* fix missing debugging messages [#213](https://github.com/bda-research/node-crawler/issues/213)`
40	`* fix bug of 'drain' never called [#210](https://github.com/bda-research/node-crawler/issues/210)`
41
42	`1.0.4`
43	`* fix bug of charset detecting [#203](https://github.com/bda-research/node-crawler/issues/203)`
44	`* keep node version up to date in travis scripts`
45
46	`1.0.3`
47	`* fix bug, skipDuplicate and rotateUA don't work even if set true`
48
49	`1.0.0`
50	`* upgrade jsdom up to 9.6.x`
51	`* remove 0.10 and 0.12 support [#170](https://github.com/bda-research/node-crawler/issues/170)`
52	`* control dependencies version using ^ and ~ [#169](https://github.com/bda-research/node-crawler/issues/169)`
53	`* remove node-pool`
54	`* notify bottleneck until a task is completed`
55	`* replace bottleneck by bottleneckp, which has priority`
56	`* change default log function`
57	* use event listener on `request` and `drain` instead of global function [#144](https://github.com/bda-research/node-crawler/issues/144)
58	`* default set forceUTF8 to true`
59	* detect `ESOCKETTIMEDOUT` instead of `ETIMEDOUT` when timeout in test
60	* add `done` function in callback to avoid async trap
61	* do not convert response body to string if `encoding` is null [#118](https://github.com/bda-research/node-crawler/issues/118)
62	`* add result document [#68](https://github.com/bda-research/node-crawler/issues/68) [#116](https://github.com/bda-research/node-crawler/issues/116)`
63	* add event `schedule` which is emitted when a task is being added to scheduler
64	* in callback, move $ into `res` because of weird API
65	`* change rateLimits to rateLimit`
66
67	`0.7.5`
68	* delete entity in options before copy, and assgin after, `jar` is one of the typical properties which is an `Entity` wich functions [#177](https://github.com/bda-research/node-crawler/issues/177)
69	* upgrade `request` to version 2.74.0
70
71	`0.7.4`
72	* change `debug` option to instance level instead of `options`
73	`* update README.md to detail error handling`
74	* call `onDrain` with scope of `this`
75	* upgrade `seenreq` version to 0.1.7
76
77	`0.7.0`
78	`* cancel recursion in queue`
79	* upgrade `request` version to v2.67.0
80
81	`0.6.9`
82	* use `bottleneckConcurrent` instead of `maxConnections`, default `10000`
83	`* add debug info`
84
85	`0.6.5`
86	`* fix a deep and big bug when initializing Pool, that may lead to sequence execution. [#2](https://github.com/bda-research/node-webcrawler/issues/2)`
87	`* print log of Pool status`
88
89	`0.6.3`
90	* you could also get `result.options` from callback even when some errors ouccurred [#127](https://github.com/bda-research/node-crawler/issues/127) [#86](https://github.com/bda-research/node-crawler/issues/86)
91	* add test for `bottleneck`
92
93	`0.6.0`
94	* add `bottleneck` to implement rate limit, one can set limit for each connection at same time.
95
96	`0.5.2`
97	* you can manually terminate all the resources in your pool, when `onDrain` called, before their timeouts have been reached
98	* add a read-only property `queueSize` to crawler [#148](https://github.com/bda-research/node-crawler/issues/148) [#76](https://github.com/bda-research/node-crawler/issues/76) [#107](https://github.com/bda-research/node-crawler/issues/107)
99
100	`0.5.1`
101	`* remove cache feature, it's useless`
102	* add `localAddress`, `time`, `tunnel`, `proxyHeaderWhiteList`, `proxyHeaderExclusiveList` properties to pass to `request` [#155](https://github.com/bda-research/node-crawler/issues/155)
103
104	`0.5.0`
105	* parse charset from `content-type` in http headers or meta tag in html, then convert
106	* big5 charset is avaliable as the `iconv-lite` has already supported it
107	`* default enable gzip in request header`
108	* remove unzip code in crawler since `request` will do this
109	* body will return as a Buffer if encoding is null which is an option in `request`
110	* remove cache and skip duplicate `request` for `GET`, `POST`(only for type `urlencode`), `HEAD`
111	* add log feature, you can use `winston` to set `logger:winston`, or crawler will output to console
112	`* rotate user-agent in case some sites ban your requests`
113