UNPKG

scrape-it

Version:
187 lines (155 loc) 5.67 kB
[![scrape-it](https://i.imgur.com/j3Z0rbN.png)](#) # scrape-it [![PayPal](https://img.shields.io/badge/%24-paypal-f39c12.svg)][paypal-donations] [![Version](https://img.shields.io/npm/v/scrape-it.svg)](https://www.npmjs.com/package/scrape-it) [![Downloads](https://img.shields.io/npm/dt/scrape-it.svg)](https://www.npmjs.com/package/scrape-it) [![Get help on Codementor](https://cdn.codementor.io/badges/get_help_github.svg)](https://www.codementor.io/johnnyb?utm_source=github&utm_medium=button&utm_term=johnnyb&utm_campaign=github) > A Node.js scraper for humans. ## :cloud: Installation ```sh $ npm i --save scrape-it ``` ## :clipboard: Example ```js const scrapeIt = require("scrape-it"); scrapeIt("http://ionicabizau.net", [ // Fetch the articles on the page (list) { listItem: ".article" , name: "articles" , data: { createdAt: { selector: ".date" , convert: x => new Date(x) } , title: "a.article-title" , tags: { selector: ".tags" , convert: x => x.split("|").map(c => c.trim()).slice(1) } , content: { selector: ".article-content" , how: "html" } } } , { listItem: "li.page" , name: "pages" , data: { title: "a" , url: { selector: "a" , attr: "href" } } } // Fetch some additional data , { title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } } ], (err, page) => { console.log(err || page); }); // { articles: // [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET), // title: 'Pi Day, Raspberry Pi and Command Line', // tags: [Object], // content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' }, // { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET), // title: 'How I ported Memory Blocks to modern web', // tags: [Object], // content: '<p>Playing computer games is a lot of fun. ...' }, // { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET), // title: 'How to convert JSON to Markdown using json2md', // tags: [Object], // content: '<p>I love and ...' } ], // pages: // [ { title: 'Blog', url: '/' }, // { title: 'About', url: '/about' }, // { title: 'FAQ', url: '/faq' }, // { title: 'Training', url: '/training' }, // { title: 'Contact', url: '/contact' } ], // title: 'Ionică Bizău', // desc: 'Web Developer, Linux geek and Musician', // avatar: '/images/logo.png' } ``` ## :memo: Documentation ### `scrapeIt(url, opts, cb)` A scraping module for humans. #### Params - **String|Object** `url`: The page url or request options. - **Object|Array** `opts`: The options passed to `scrapeCheerio` method. - **Function** `cb`: The callback function. #### Return - **Tinyreq** The request object. ### `scrapeIt.scrapeCheerio($input, opts, $)` Scrapes the data in the provided element. #### Params - **Cheerio** `$input`: The input element. - **Object** `opts`: An array or object containing the scraping information. If you want to scrape a list, you have to use the `listItem` selector: - `listItem` (String): The list item selector. - `name` (String): The list name (e.g. `articles`). - `data` (Object): The fields to include in the list objects: - `<fieldName>` (Object|String): The selector or an object containing: - `selector` (String): The selector. - `convert` (Function): An optional function to change the value. - `how` (Function|String): A function or function name to access the value. - `attr` (String): If provided, the value will be taken based on the attribute name. - `trim` (Boolean): If `false`, the value will *not* be trimmed (default: `true`). - `eq` (Number): If provided, it will select the *nth* element. - `listItem` (Object): An object, keeping the recursive schema of the `listItem` object. This can be used to create nested lists. **Example**: ```js { listItem: ".article" , name: "articles" , data: { createdAt: { selector: ".date" , convert: x => new Date(x) } , title: "a.article-title" , tags: { selector: ".tags" , convert: x => x.split("|").map(c => c.trim()).slice(1) } , content: { selector: ".article-content" , how: "html" } } } ``` If you want to collect specific data from the page, just use the same schema used for the `data` field. **Example**: ```js { title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } } ``` - **Function** `$`: The Cheerio function. #### Return - **Object** The scrapped data. ## :yum: How to contribute Have an idea? Found a bug? See [how to contribute][contributing]. ## :scroll: License [MIT][license] © [Ionică Bizău][website] [paypal-donations]: https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=RVXDDLKKLQRJW [donate-now]: http://i.imgur.com/6cMbHOC.png [license]: http://showalicense.com/?fullname=Ionic%C4%83%20Biz%C4%83u%20%3Cbizauionica%40gmail.com%3E%20(http%3A%2F%2Fionicabizau.net)&year=2016#license-mit [website]: http://ionicabizau.net [contributing]: /CONTRIBUTING.md [docs]: /DOCUMENTATION.md