UNPKG

15.7 kBMarkdownView Raw
1site-mapper
2===========
3Site Map Generation in node.js
4
5## Installation ##
6
7This module is intended to be used as a dependency in a website specific
8site map building project. Add the module to the "dependencies" section
9of a package.json file:
10
11```json
12{
13 "dependencies": {
14 "site-mapper": ">= 2.0.1"
15 }
16}
17```
18
19There is nothing stopping you from installing it locally and editing the embedded
20configuration files, but that is not the intent.
21
22 npm install --save site-mapper
23
24## Running site-mapper ##
25
26Create a directory to hold your site map generation configuration. This
27directory will hold all the files needed to tell site-mapper what to
28create.
29
30### Dependencies ###
31Create a package.json file similar to the following:
32
33```json
34{
35 "author": {
36 "name": "YOUR NAME HERE",
37 "email": "YOUR EMAIL HERE"
38 },
39 "name": "my-website-site-maps",
40 "description": "sitemap generation for mysite.com",
41 "version": "0.0.1",
42 "homepage": "",
43 "keywords": [
44 "sitemap"
45 ],
46 "dependencies": {
47 "site-mapper": ">= 2.0.1"
48 },
49 "engines": {
50 "node": "*"
51 }
52}
53```
54
55### Configuration ###
56
57Create a directory called ./config. For each environment you will generate sitemaps
58for (possibly you only have one, production), create a javascript or
59coffeescript file in the config directory named for the environment:
60
61 ./config/production.js, ./config/production.coffee or ./config/production.es6
62
63#### ES6 Configuration Files ####
64If you wish to use es6 for your configuration, save the configuration using the .es6
65file extension:
66
67 ./config/production.es6
68
69And add the following to your package.json as dependencies:
70
71```json
72{
73 "dependencies": {
74 "babel-register": "^6.9.0",
75 "babel-preset-es2015": "^6.6.0",
76 "babel-preset-stage-0": "^6.5.0"
77 }
78}
79```
80
81You will also have to create a .babelrc file in the root of your project with the
82following contents:
83
84```
85{
86 "presets": ["es2015", "stage-0"]
87}
88```
89
90At this time, any es6 configuration file should export the configuration object
91using module.exports instead of es6 export statements.
92
93#### Coffeescript Configuration Files ####
94
95If you want to use coffeescript for configuration, you will have to add
96the coffee script module as a dependency:
97
98```json
99{
100 "dependencies": {
101 "coffee-script": "^1.10.0"
102 }
103}
104```
105
106#### Configuration Format ####
107
108The configuration file can contain any of the following keys. The
109values below are defaults that will be used unless overridden in your
110configuration file.
111
112```coffee
113config = {}
114config.sources = {}
115config.sitemaps = {}
116config.logConfig =
117 name: "sitemapper",
118 level: "debug"
119config.defaultSitemapConfig =
120 targetDirectory: "#{process.cwd()}/tmp/sitemaps/#{config.env}"
121 sitemapIndex: "sitemap.xml"
122 sitemapRootUrl: "http://www.mysite.com"
123 sitemapFileDirectory: "/sitemaps"
124 maxUrlsPerFile: 50000
125 urlBase: "http://www.mysite.com"
126config.sitemapIndexHeader = '<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
127config.sitemapHeader = '<?xml version="1.0" encoding="UTF-8"?><urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:geo="http://www.google.com/geo/schemas/sitemap/1.0" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9/">'
128config.defaultUrlFormatter = (options) ->
129 (href) ->
130 if '/' == href
131 options.urlBase
132 else if href && href.length && href[0] == '/'
133 "#{options.urlBase}#{href}"
134 else if href && href.length && href.match(/^https?:\/\//)
135 href
136 else
137 if href.length
138 "#{options.urlBase}/#{href}"
139 else
140 options.urlBase
141```
142
143The sitemaps object contains named keys pointing at objects that define a
144particular sitemap. The sitemap definition can contain (and override) any
145of the keys in the config.defaultSitemapConfig object.
146The produced sitemap consists of a sitemap index xml file referencing one or
147more gzipped sitemap xml files, created from urls produced by the config.sources objects.
148The configuration allows for defining 1 or more sitemaps to create, for example,
149you might configure one sitemap for the http version of a site and another sitemap
150for the https version of the site. Or you might define one sitemap for the
151www subdomain and another for the foobar subdomain. By default, all sources
152defined in config.sources are used to generate urls for all sitemaps. To use
153different sources for different sitemaps, provide in each sitemap configuration object
154a sources key like one of the following:
155
156```coffee
157# Specify which sources to include. All others are ignored
158sources:
159 includes: ['source1', 'source2', ...]
160```
161or
162```coffee
163# Specify which sources to exclude. All others are included
164sources:
165 excludes: ['source1', 'source2', ...]
166```
167
168The sources object contains arbitrarily named keys pointing at functions that take
169a single sitemapConfig object and return an object with the
170following keys: type, options. The options key points at a source configuration object
171that can contain the following keys: input, options, siteMap, cached, ignoreErrors
172
173The input parameter, sitemapConfig, is an object
174formed by merging the config.defaultSitemapConfig object with the specific sitemap
175configuration (more about this later).
176
177In the returned object, the type is either one of the built in source
178type classes (see below) or a site specific class derived from the SitemapTransformer base class.
179
180A minimal config might be:
181
182```coffee
183{StaticSetSource, JsonSource} = require 'site-mapper'
184appConfig =
185 sitemaps:
186 main:
187 sitemapRootUrl: "http://staging.mysite.com"
188 urlBase: "http://staging.mysite.com"
189 sitemapIndex: "sitemap_index.xml"
190 targetDirectory: "#{process.cwd()}/tmp/sitemaps/#{config.env}/http"
191 sources:
192 staticUrls: (sitemapConfig) ->
193 type: StaticSetSource
194 options:
195 siteMap:
196 channel: 'static'
197 changefreq: 'weekly'
198 priority: 1
199 options:
200 urls: [
201 '/',
202 '/about',
203 '/faq',
204 '/jobs'
205 ]
206 serviceUrls: (sitemapConfig) ->
207 type: JsonSource
208 options:
209 siteMap:
210 changefreq: 'weekly'
211 priority: 0.8
212 channel: (url) -> url.category
213 urlAugmenter: (url) ->
214 url.url = "http://#{sitemapConfig.urlBase}/widgets/#{url.category}/#{url.url}"
215 input:
216 url: "http://api.mysite.com/widgets"
217 options:
218 filter: /urls\./
219
220module.exports = appConfig
221```
222
223### Logging ###
224site-mapper logs using bunyan format. To get pretty printed logs while running, just pipe
225the site-mapper output through the bunyan command line tool.
226
227### Running the Code ###
228
229Finally, putting it all together, you can generate the sitemaps as follows:
230
231 1. Install all the dependencies:
232 ```
233 rm -rf node_modules;
234 npm install
235 ```
236 1. Run the generator:
237 ```
238 NODE_ENV=staging ./node_modules/.bin/site-mapper | ./node_modules/site-mapper/node_modules/.bin/bunyan
239 ```
240
241Below is a make file that encapsulates the above recipe. It can be run
242by running:
243
244 make setup generate
245
246```make
247usage :
248 @echo ''
249 @echo 'Core tasks : Description'
250 @echo '-------------------- : -----------'
251 @echo 'make setup : Install dependencies'
252 @echo 'make generate : Generate the sitemaps'
253 @echo ''
254
255SITEMAPPER=./node_modules/.bin/site-mapper
256NPM_ARGS=
257NODE_ENV=staging
258
259setup :
260 @rm -rf node_modules
261 @echo npm $(NPM_ARGS) install
262 @npm $(NPM_ARGS) install
263
264generate :
265 @rm -rf tmp
266 @NODE_ENV=$(NODE_ENV) $(SITEMAPPER)
267```
268
269## Sitemap Generation ##
270
271The site-mapper module views the sitemap generation process as follows:
272
273SiteMapper creates one or more Source objects, pipes each one to a Sitemap, which
274then pipes to one or more SitemapFile objects, depending on the number of urls the
275source produces and the configured maximum number of urls per SitemapFile (50,000
276by default).
277
278
279 +------------+ +------------+ +--------------+ +-------------+
280 | SiteMapper | | Source | | Sitemap | | SitemapFile |
281 |------------| creates |------------|creates |--------------| adds |-------------|
282 | +--------->| |------->| |------>| |
283 | | | | urls | | urls | |
284 +------------+ +------------+ +--------------+ +-------------+
285
286### Sources ###
287
288Sources are Javascript streaming API Transform stream implementations that operate in object mode
289and produce url objects from data of a specific format. There are three included Source implementations:
290
291 1. StaticSetSource - This source is configured with a static list of urls strings
292 in a configuration file.
293 1. CsvSource - Produces urls from CSV data.
294 1. JsonSource - Produces urls from JSON data.
295 1. XmlSource - Produces urls from XML data.
296
297Sources are configured with an input that produces raw text data of the right format. Source inputs can
298be either files, urls or an instantiated Readable stream object. The source reads the input
299and converts it into Url objects. The Url objects may then be filtered based on the url
300object properies and then are provided for upstream consumers in the Transform implementation.
301
302#### Source configuration details ####
303
304The configuration object for sources has the following form:
305
306```coffee
307config =
308 ignoreErrors: true|false
309 input: {},
310 options: {},
311 siteMap: {}
312 cached: {}
313```
314
315 * ignoreErrors
316 Set to true if you want to log and ignore errors. If set to false (default) an error in
317 the input stream aborts the entire process.
318 * input object
319 This object can have one of the following keys:
320 1. fileName - full path of the file containing the url data
321 2. url - URL that will produce the url data
322 3. stream - An instantiated streaming API Readable object that when read, produces url data
323 * options object
324 Defines source specific options (see below)
325 * siteMap object
326 Defines sitemap information specific to the source, like the priority of urls it produces.
327 * cached object
328 If present, turns on caching of the data produced by the input so that subsequent runs or
329 even other sources in the configuration can use it. Contains the cacheFile attribute pointing
330 at the path for the cached data and maxAge attribute, a time in milliseconds the cached data
331 should be considered fresh.
332
333#### URL Channel ####
334
335Each url is associated with a channel. The channel can be a static string or
336derived from the url at runtime. In either case, the channel is specified by setting
337the channel attribute on the source configuration object's siteMap object. It can
338be either a string (static channel) or a function taking a Url object and returning
339a string.
340
341```coffee
342sources =
343 staticChannel: (sitemapConfig) ->
344 siteMap:
345 channel: 'foo'
346 dynamicChannel: (sitemapConfig) ->
347 siteMap:
348 channel: (url) -> url.category
349```
350
351The channel is used to name individual sitemap files where urls produced by the sources will
352end up. Files of the form ${CHANNEL}${SEQUENCE}.xml.gz will be created in the target directory.
353
354### Sitemaps ###
355
356A Sitemap is created for each URL channel. As urls are added to their corresponding
357Sitemap, the group will create sequentially numbered Sitemap files, each containing a
358configurable number of urls, 50,000 by default. The name of the sitemap files is of
359the form
360
361 ${CHANNEL}${SEQUENCE}.xml.gz
362
363The Channel is produced as described above
364
365IMPORTANT: If two sources are configured with the same channel, or they use a dynamic channel function
366that produces the same channel, the second source will overwrite sitemaps created by the first
367source. It is up to the user to ensure that different sources produce different channels.
368
369### Sitemap Index and Files ###
370
371site-mapper will create a sitemap index and as many sitemap files as are required. The
372sitemap files are gzipped. There is no way to turn gzipping off.
373
374### Publishing Sitemaps to Search Engines ###
375
376It is up to you how you expose the site maps generated by site mapper to Google and other
377search engines. Each company does this differently, so there are no default publishing
378mechanisms in site-mapper.
379
380## Tests ##
381
382npm run test
383
384This should install all dependencies and run the test suite
385
386
387## Sources ##
388
389### CsvSource ###
390
391```coffeescript
392options.options =
393 columns: ['url', 'imageUrl', 'lastModified]
394 relax\_column\_count: true|false
395```
396
397- columns: Specify the names of the columns.
398- relax\_column\_count: true if you don't want an error raised if the number of columns in the csv data is different from what is configured.
399
400### XmlSource ###
401
402```coffeescript
403options.options =
404 urlTag: 'url',
405 urlAttributes:
406 lastmod: 'lastModified',
407 changefreq: 'changeFrequency',
408 priority: 'priority',
409 loc: 'url'
410```
411
412- urlTag: Name of tags containing url data in the xml document
413- urlAttributes: map of tag name to url attribute names within the urlTag markup.
414
415Out of the box, the XmlSource is configured to read sitemap.xml files. This is useful
416if you have existing sitemaps or sitemaps generated by some other method that you
417wish to include with sitemaps generated from other data sources.
418
419### JsonSource ###
420
421The JSON source expects an array to exist in the json data that contains objects or
422strings representing urls.
423
424```coffeescript
425options.options =
426 filter: /regex/
427 transformer: (obj) ->
428 stringArray: true|false
429```
430
431- filter: A regex that is used to specify where the array(s) of urls are in the json document. See https://www.npmjs.com/package/stream-json#filter for how this works
432- transformer: A function taking the raw url object or string from the json document and turning it into an object suitable to construct Url objects
433- stringArray: Set to true if the array contains string urls rather than objects.
434
435## Source Input Streams ##
436Sources can read data from local files (input: fileName: ''), from urls: (input: url: '') or
437from Streaming API Readable object instances (input: stream: new Foo()).
438
439site-mapper ships with the following custom input stream classes:
440
441```coffeescript
442MultipleHttpInput({urls: ['url1', 'url2', ...]});
443```
444
445Each url is called and it's data parsed by the underlying Source. Really only works with CSV data right now.
446
447```coffeescript
448PaginatedHttpInput({url, pagination, format, stop})
449 url: base url for http requests
450 pagination:
451 page: 'page' # the name of the paging parameter
452 per: 'per' # the name of the limit or per page parameter
453 perPage: 100 # the value of the limit or per page parameter
454 increment: 1 # the amount to increase the page parameter value for each request
455 start: 1 # the value of the page parameter for the first request
456 format: 'json' # if 'json', the paginated responses are wrapped in a json array [ {page1response}, {page2response}, ...]
457 stop:
458 string: 'foo' # the string to look for in a response that indicates the pagination is over, ie there is no more data.
459```