UNPKG

4.56 kBMarkdownView Raw
1#readabilitySAX
2a fast and platform independent readability port
3
4##About
5One day, I wanted to use [Readability](http://code.google.com/p/arc90labs-readability/), an algorithm to extract relevant pieces of information out of websites, for a node.js project. There are some ports of Readability to node (using jsdom, e.g. [that one](https://github.com/arrix/node-readability)), but they are pretty slow. I don't want to wait for more than a second (literally) until my node instance is ready to continue. So I started this project, porting the code to a SAX parser.
6
7In my tests, most pages, even large ones, were finished within 15ms (on node, see below for more information). It works with Rhino, so it runs on [YQL](http://developer.yahoo.com/yql "Yahoo! Query Language"), which may have interesting uses. And it works within a browser.
8
9The Readability extraction algorithm was completely ported, but some adjustments were made:
10
11* `<article>` tags are recognized and gain a higher value
12
13* If a heading is part of the pages `<title>`, it is removed (Readability removed any single `<h2>`, and ignored other tags)
14
15* `henry` and `instapaper-body` are classes to show an algorithm like this where the content is. readabilitySAX recognizes them and marks them as the article
16
17* Every bit of code that was taken from the original algorithm was optimized, eg. RegExps should now perform faster (they were optimized & use `RegExp#test` instead of `String#match`, which doesn't force the interpreter to build an array).
18
19##HowTo
20###Installing readabilitySAX (node)
21This module is available on `npm` as `readabilitySAX`. Just run
22
23 npm install readabilitySAX
24
25###Usage
26#####Node
27Just run `require("readabilitySAX")`. You'll get an object containing three methods:
28
29* `get(link, callback)`: Gets a webpage and process it.
30
31* `process(data)`: Takes a string, runs readabilitySAX and returns the page.
32
33* `Readability(settings)`: The readability object. It works as a handler for `htmlparser2`.
34
35#####Browsers
36
37I started to implement simplified SAX-"parsers" for Rhino/YQL (using E4X) and the browser (using the DOM) to increase the overall performance on those platforms. The DOM version is inside the `/browsers` dir.
38
39A demo of how to use readabilitySAX inside a browser may be found at [jsFiddle](http://jsfiddle.net/pXqYR/embedded/). Some basic example files are inside the `/browsers` directory.
40
41#####YQL
42
43A table using E4X-based events is available as the community table `redabilitySAX`, as well as [here](https://github.com/FB55/yql-tables/tree/master/readabilitySAX).
44
45##Parsers (on node)
46Most SAX parsers (as sax.js) fail when a document is malformed XML, even if it's correct HTML. readabilitySAX should be used with [htmlparser2](https://github.com/FB55/node-htmlparser), my fork of the `htmlparser`-module (used by eg. `jsdom`), which corrects most faults. It's listed as a dependency, so npm should install it with readabilitySAX.
47
48##Performance
49Using a package of 680 pages from [CleanEval](http://cleaneval.sigwac.org.uk) (their website seems to be down, try to google it), readabilitySAX processed all of them in 6667 ms, that's an average of 9.8 ms per page.
50
51The benchmark was done using `tests/benchmark.js` on a MacBook (late 2010) and is probably far from perfect.
52
53Performance is the main goal of this project. The current speed should be good enough to run readabilitySAX on a singe-threaded web server with an average number of requests. That's an accomplishment!
54
55##Settings
56These are the options that one may pass to the Readability object:
57
58* `stripUnlikelyCandidates`: Removes elements that probably don't belong to the article. Default: true
59
60* `weightClasses`: Indicates whether classes should be scored. This may lead to shorter articles. Default: true
61
62* `cleanConditionally`: Removes elements that don't match specific criteria (defined by the original Readability). Default: true
63
64* `cleanAttributes`: Only allow some attributes, ignore all the crap nobody needs. Default: true
65
66* `searchFurtherPages`: Indicates whether links should be checked whether they point to the next page of an article. Default: true
67
68* `linksToSkip`: A map of pages that should be ignored when searching links to further pages. Default: {}
69
70* `pageURL`: The URL of the current page. Will be used to resolve all other links and is ignored when searching links. Default: ""
71
72* `resolvePaths`: Indicates whether ".." and "." inside paths should be eliminated. Default: false
73
74##Todo
75
76- Add documentation & examples
77- Improve the performance (always)
\No newline at end of file