1 | #readabilitySAX
|
2 | a fast and platform independent readability port
|
3 |
|
4 | ##About
|
5 | One day, I wanted to use [Readability](http://code.google.com/p/arc90labs-readability/), an algorithm to extract relevant pieces of information out of websites, for a node.js project. There are some ports of Readability to node (using jsdom, e.g. [that one](https://github.com/arrix/node-readability)), but they are pretty slow. I don't want to wait for more than a second (literally) until my node instance is ready to continue. So I started this project, porting the code to a SAX parser.
|
6 |
|
7 | In my tests, most pages, even large ones, were finished within 15ms (on node, see below for more information). It works with Rhino, so it runs on [YQL](http://developer.yahoo.com/yql "Yahoo! Query Language"), which may have interesting uses. And it works within a browser.
|
8 |
|
9 | The Readability extraction algorithm was completely ported, but some adjustments were made:
|
10 |
|
11 | * `<article>` tags are recognized and gain a higher value
|
12 |
|
13 | * If a heading is part of the pages `<title>`, it is removed (Readability removed any single `<h2>`, and ignored other tags)
|
14 |
|
15 | * `henry` and `instapaper-body` are classes to show an algorithm like this where the content is. readabilitySAX recognizes them and marks them as the article
|
16 |
|
17 | * Every bit of code that was taken from the original algorithm was optimized, eg. RegExps should now perform faster (they were optimized & use `RegExp#test` instead of `String#match`, which doesn't force the interpreter to build an array).
|
18 |
|
19 | ##HowTo
|
20 | ###Installing readabilitySAX (node)
|
21 | This module is available on `npm` as `readabilitySAX`. Just run
|
22 |
|
23 | npm install readabilitySAX
|
24 |
|
25 | ###Usage
|
26 | #####Node
|
27 | Just run `require("readabilitySAX")`. You'll get an object containing three methods:
|
28 |
|
29 | * `get(link, callback)`: Gets a webpage and process it.
|
30 |
|
31 | * `process(data)`: Takes a string, runs readabilitySAX and returns the page.
|
32 |
|
33 | * `Readability(settings)`: The readability object. It works as a handler for `htmlparser2`.
|
34 |
|
35 | #####Browsers
|
36 |
|
37 | I started to implement simplified SAX-"parsers" for Rhino/YQL (using E4X) and the browser (using the DOM) to increase the overall performance on those platforms. The DOM version is inside the `/browsers` dir.
|
38 |
|
39 | A demo of how to use readabilitySAX inside a browser may be found at [jsFiddle](http://jsfiddle.net/pXqYR/embedded/). Some basic example files are inside the `/browsers` directory.
|
40 |
|
41 | #####YQL
|
42 |
|
43 | A table using E4X-based events is available as the community table `redabilitySAX`, as well as [here](https://github.com/FB55/yql-tables/tree/master/readabilitySAX).
|
44 |
|
45 | ##Parsers (on node)
|
46 | Most SAX parsers (as sax.js) fail when a document is malformed XML, even if it's correct HTML. readabilitySAX should be used with [htmlparser2](https://github.com/FB55/node-htmlparser), my fork of the `htmlparser`-module (used by eg. `jsdom`), which corrects most faults. It's listed as a dependency, so npm should install it with readabilitySAX.
|
47 |
|
48 | ##Performance
|
49 | Using a package of 680 pages from [CleanEval](http://cleaneval.sigwac.org.uk) (their website seems to be down, try to google it), readabilitySAX processed all of them in 6667 ms, that's an average of 9.8 ms per page.
|
50 |
|
51 | The benchmark was done using `tests/benchmark.js` on a MacBook (late 2010) and is probably far from perfect.
|
52 |
|
53 | Performance is the main goal of this project. The current speed should be good enough to run readabilitySAX on a singe-threaded web server with an average number of requests. That's an accomplishment!
|
54 |
|
55 | ##Settings
|
56 | These are the options that one may pass to the Readability object:
|
57 |
|
58 | * `stripUnlikelyCandidates`: Removes elements that probably don't belong to the article. Default: true
|
59 |
|
60 | * `weightClasses`: Indicates whether classes should be scored. This may lead to shorter articles. Default: true
|
61 |
|
62 | * `cleanConditionally`: Removes elements that don't match specific criteria (defined by the original Readability). Default: true
|
63 |
|
64 | * `cleanAttributes`: Only allow some attributes, ignore all the crap nobody needs. Default: true
|
65 |
|
66 | * `searchFurtherPages`: Indicates whether links should be checked whether they point to the next page of an article. Default: true
|
67 |
|
68 | * `linksToSkip`: A map of pages that should be ignored when searching links to further pages. Default: {}
|
69 |
|
70 | * `pageURL`: The URL of the current page. Will be used to resolve all other links and is ignored when searching links. Default: ""
|
71 |
|
72 | * `resolvePaths`: Indicates whether ".." and "." inside paths should be eliminated. Default: false
|
73 |
|
74 | ##Todo
|
75 |
|
76 | - Add documentation & examples
|
77 | - Improve the performance (always) |
\ | No newline at end of file |