# Scraping Headlines Using Stew

*This is a complete (but simple) example of using [Stew](https://github.com/rodw/stew) to extract content from the web. It is written as a "litcoffee" file, which is an executable/compilable file containing Markdown content with embedded CoffeeScript. ([Follow this link to go back to the README file.](../README.html))*

In this example, we'll extract headlines from the venerable social-tech-news site [Slashdot](http://slashdot.org/).

    URL = 'http://slashdot.org/'

If you examine the HTML of the Slashdot homepage carefully, you'll find that each headline is contained in an `h2` tag with the class `story`, and that within this heading there is an anchor (`a`) tag that contains the link.  As a CSS selector, that looks like:

    SELECTOR = 'h2.story a'

We'll use that selector to extract the headlines and links from the HTML print them to the console with the following function:

    print_headline = (node)->
      headline = domutil.to_text(node)
      link = "http:#{node.attribs.href}"
      console.log "#{headline} <#{link}>"

(`domutil` is an instance of Stew's `DOMUtil` type, which is imported below.)

Now, given an `html` string, selecting and printing the headlines is as simple as this:

    select_and_print_headlines = (html)->
      stew.select html, SELECTOR, (err,nodeset)->
        for node in nodeset
          print_headline node

That's really all there is to it. All of the Stew-specific code is found above.

The rest of this file jumps through the hoops needed to download the HTML document from the web.

## Importing the Library

When using Stew, you'll typically import the library using something like this:

    # This is what you'll typically do:
    # stew = new (require('stew-select')).Stew()
    # and/or
    # domutil = new (require('stew-select')).DOMUtil()

but since this file is found *within* the Stew repository itself, we'll do things a little differently.  Most readers can safely ignore these next few lines and use the simple `require` statement above instead.

    # You WON'T do the following. We're only doing it here because we
    # want to use the "local" implementation of Stew.
    fs          = require 'fs'
    path        = require 'path'
    HOMEDIR     = path.join(__dirname,'..')
    LIB_COV_DIR = path.join(HOMEDIR,'lib-cov')
    LIB_DIR     = if fs.existsSync(LIB_COV_DIR) then LIB_COV_DIR else path.join(HOMEDIR,'lib')
    stew        = new (require(path.join(LIB_DIR,'stew'))).Stew()
    domutil     = new (require(path.join(LIB_DIR,'stew'))).DOMUtil()

## Setting up the HTTP "Fetcher"

Let's define a function that will fetch a web page and pass the resulting content to a callback function.  We'll use the Node.js `http` library for this.

    http = require 'http'

Our function will accept the `url` for the document to download and a `callback` function to invoke once the document is parsed.

Following Node.js convention, we'll use the signature `callback(err,body)` for the callback function.

    fetch = (url,callback)->

Using `http`, we'll create an callback function to buffer the HTTP response:

      http_callback = (response)->
        unless 200 <= response.statusCode <= 299
          callback "Unexpected status code #{response.statusCode}"
        else
          buffer = ""
          response.setEncoding 'utf8'
          response.on 'data', (chunk)->buffer += chunk

and, when the full response body has been recieved, pass it to the callback:

          response.on 'end', ()-> callback(null,buffer)

Finally, we can trigger the actual request:

      http.get(url, http_callback).on('error', callback)

Now our `fetch` method will download content from the URL and pass it to a callback function.

## Actual processing

Now we can fetch the document and print the result using our `select_and_print` method:

    fetch URL, (err,body)->
      if err?
        console.error "Error:", err
      else
        console.log '-----------------------------------------'
        console.log "CURRENT HEADLINES AT #{URL}"
        console.log '-----------------------------------------'
        select_and_print_headlines body
        console.log '-----------------------------------------'

## Running this script

Now we can run this script by typing:

```console
coffee docs/example.litcoffee
```

and see output like the following:

```console
-----------------------------------------
CURRENT HEADLINES AT http://slashdot.org/
-----------------------------------------
DRM: How Book Publishers Failed To Learn From the Music Industry <http://news.slashdot.org/story/13/05/31/2045211/drm-how-book-publishers-failed-to-learn-from-the-music-industry>
Small Black Holes: Cloudy With a Chance of Better Visibility <http://science.slashdot.org/story/13/05/31/214224/small-black-holes-cloudy-with-a-chance-of-better-visibility>
No, the Tesla Model S Doesn't Pollute More Than an SUV <http://tech.slashdot.org/story/13/05/31/1955214/no-the-tesla-model-s-doesnt-pollute-more-than-an-suv>
The Case For a Government Bug Bounty Program <http://it.slashdot.org/story/13/05/31/1933231/the-case-for-a-government-bug-bounty-program>
When Smart Developers Generate Crappy Code <http://developers.slashdot.org/story/13/05/31/1854203/when-smart-developers-generate-crappy-code>
New York City Wants To Revive Old Voting Machines <http://tech.slashdot.org/story/13/05/31/1748201/new-york-city-wants-to-revive-old-voting-machines>
Big Asteroid (With Its Own Moon) To Have Closest Approach With Earth Today <http://science.slashdot.org/story/13/05/31/1727256/big-asteroid-with-its-own-moon-to-have-closest-approach-with-earth-today>
Google Maps Used To Find Tax Cheats <http://tech.slashdot.org/story/13/05/31/1721232/google-maps-used-to-find-tax-cheats>
Judge Orders Google To Comply With FBI's Warrantless NSL Requests <http://yro.slashdot.org/story/13/05/31/1633209/judge-orders-google-to-comply-with-fbis-warrantless-nsl-requests>
Ask Slashdot: How Important Is Advanced Math In a CS Degree? <http://ask.slashdot.org/story/13/05/31/1546253/ask-slashdot-how-important-is-advanced-math-in-a-cs-degree>
Badgers Block British Broadband Buildout <http://news.slashdot.org/story/13/05/31/1530227/badgers-block-british-broadband-buildout>
Confirmed: Water Once Flowed On Mars <http://science.slashdot.org/story/13/05/31/1523245/confirmed-water-once-flowed-on-mars>
Motorola Developing Pill and Tattoo Authentication Methods <http://it.slashdot.org/story/13/05/31/1414210/motorola-developing-pill-and-tattoo-authentication-methods>
Seeing Atomic Bonds Before and After Reactions <http://science.slashdot.org/story/13/05/31/1353241/seeing-atomic-bonds-before-and-after-reactions>
U.S. Authorizes Sales of American Communication Tech To Iran <http://news.slashdot.org/story/13/05/31/145229/us-authorizes-sales-of-american-communication-tech-to-iran>
-----------------------------------------
```

*([Follow this link to go back to the README file.](../README.html))*
