1 | site-mapper
|
2 | ===========
|
3 | Site Map Generation in node.js
|
4 |
|
5 | ## Installation ##
|
6 |
|
7 | This module is intended to be used as a dependency in a website specific
|
8 | site map building project. Add the module to the "dependencies" section
|
9 | of a package.json file:
|
10 |
|
11 | ```json
|
12 | {
|
13 | "dependencies": {
|
14 | "site-mapper": ">= 2.0.1"
|
15 | }
|
16 | }
|
17 | ```
|
18 |
|
19 | There is nothing stopping you from installing it locally and editing the embedded
|
20 | configuration files, but that is not the intent.
|
21 |
|
22 | npm install --save site-mapper
|
23 |
|
24 | ## Running site-mapper ##
|
25 |
|
26 | Create a directory to hold your site map generation configuration. This
|
27 | directory will hold all the files needed to tell site-mapper what to
|
28 | create.
|
29 |
|
30 | ### Dependencies ###
|
31 | Create a package.json file similar to the following:
|
32 |
|
33 | ```json
|
34 | {
|
35 | "author": {
|
36 | "name": "YOUR NAME HERE",
|
37 | "email": "YOUR EMAIL HERE"
|
38 | },
|
39 | "name": "my-website-site-maps",
|
40 | "description": "sitemap generation for mysite.com",
|
41 | "version": "0.0.1",
|
42 | "homepage": "",
|
43 | "keywords": [
|
44 | "sitemap"
|
45 | ],
|
46 | "dependencies": {
|
47 | "site-mapper": ">= 2.0.1"
|
48 | },
|
49 | "engines": {
|
50 | "node": "*"
|
51 | }
|
52 | }
|
53 | ```
|
54 |
|
55 | ### Configuration ###
|
56 |
|
57 | Create a directory called ./config. For each environment you will generate sitemaps
|
58 | for (possibly you only have one, production), create a javascript or
|
59 | coffeescript file in the config directory named for the environment:
|
60 |
|
61 | ./config/production.js, ./config/production.coffee or ./config/production.es6
|
62 |
|
63 | #### ES6 Configuration Files ####
|
64 | If you wish to use es6 for your configuration, save the configuration using the .es6
|
65 | file extension:
|
66 |
|
67 | ./config/production.es6
|
68 |
|
69 | And add the following to your package.json as dependencies:
|
70 |
|
71 | ```json
|
72 | {
|
73 | "dependencies": {
|
74 | "babel-register": "^6.9.0",
|
75 | "babel-preset-es2015": "^6.6.0",
|
76 | "babel-preset-stage-0": "^6.5.0"
|
77 | }
|
78 | }
|
79 | ```
|
80 |
|
81 | You will also have to create a .babelrc file in the root of your project with the
|
82 | following contents:
|
83 |
|
84 | ```
|
85 | {
|
86 | "presets": ["es2015", "stage-0"]
|
87 | }
|
88 | ```
|
89 |
|
90 | At this time, any es6 configuration file should export the configuration object
|
91 | using module.exports instead of es6 export statements.
|
92 |
|
93 | #### Coffeescript Configuration Files ####
|
94 |
|
95 | If you want to use coffeescript for configuration, you will have to add
|
96 | the coffee script module as a dependency:
|
97 |
|
98 | ```json
|
99 | {
|
100 | "dependencies": {
|
101 | "coffee-script": "^1.10.0"
|
102 | }
|
103 | }
|
104 | ```
|
105 |
|
106 | #### Configuration Format ####
|
107 |
|
108 | The configuration file can contain any of the following keys. The
|
109 | values below are defaults that will be used unless overridden in your
|
110 | configuration file.
|
111 |
|
112 | ```coffee
|
113 | config = {}
|
114 | config.sources = {}
|
115 | config.sitemaps = {}
|
116 | config.logConfig =
|
117 | name: "sitemapper",
|
118 | level: "debug"
|
119 | config.defaultSitemapConfig =
|
120 | targetDirectory: "#{process.cwd()}/tmp/sitemaps/#{config.env}"
|
121 | sitemapIndex: "sitemap.xml"
|
122 | sitemapRootUrl: "http://www.mysite.com"
|
123 | sitemapFileDirectory: "/sitemaps"
|
124 | maxUrlsPerFile: 50000
|
125 | urlBase: "http://www.mysite.com"
|
126 | config.sitemapIndexHeader = '<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
|
127 | config.sitemapHeader = '<?xml version="1.0" encoding="UTF-8"?><urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:geo="http://www.google.com/geo/schemas/sitemap/1.0" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9/">'
|
128 | config.defaultUrlFormatter = (options) ->
|
129 | (href) ->
|
130 | if '/' == href
|
131 | options.urlBase
|
132 | else if href && href.length && href[0] == '/'
|
133 | "#{options.urlBase}#{href}"
|
134 | else if href && href.length && href.match(/^https?:\/\//)
|
135 | href
|
136 | else
|
137 | if href.length
|
138 | "#{options.urlBase}/#{href}"
|
139 | else
|
140 | options.urlBase
|
141 | ```
|
142 |
|
143 | The sitemaps object contains named keys pointing at objects that define a
|
144 | particular sitemap. The sitemap definition can contain (and override) any
|
145 | of the keys in the config.defaultSitemapConfig object.
|
146 | The produced sitemap consists of a sitemap index xml file referencing one or
|
147 | more gzipped sitemap xml files, created from urls produced by the config.sources objects.
|
148 | The configuration allows for defining 1 or more sitemaps to create, for example,
|
149 | you might configure one sitemap for the http version of a site and another sitemap
|
150 | for the https version of the site. Or you might define one sitemap for the
|
151 | www subdomain and another for the foobar subdomain. By default, all sources
|
152 | defined in config.sources are used to generate urls for all sitemaps. To use
|
153 | different sources for different sitemaps, provide in each sitemap configuration object
|
154 | a sources key like one of the following:
|
155 |
|
156 | ```coffee
|
157 | # Specify which sources to include. All others are ignored
|
158 | sources:
|
159 | includes: ['source1', 'source2', ...]
|
160 | ```
|
161 | or
|
162 | ```coffee
|
163 | # Specify which sources to exclude. All others are included
|
164 | sources:
|
165 | excludes: ['source1', 'source2', ...]
|
166 | ```
|
167 |
|
168 | The sources object contains arbitrarily named keys pointing at functions that take
|
169 | a single sitemapConfig object and return an object with the
|
170 | following keys: type, options. The options key points at a source configuration object
|
171 | that can contain the following keys: input, options, siteMap, cached, ignoreErrors
|
172 |
|
173 | The input parameter, sitemapConfig, is an object
|
174 | formed by merging the config.defaultSitemapConfig object with the specific sitemap
|
175 | configuration (more about this later).
|
176 |
|
177 | In the returned object, the type is either one of the built in source
|
178 | type classes (see below) or a site specific class derived from the SitemapTransformer base class.
|
179 |
|
180 | A minimal config might be:
|
181 |
|
182 | ```coffee
|
183 | {StaticSetSource, JsonSource} = require 'site-mapper'
|
184 | appConfig =
|
185 | sitemaps:
|
186 | main:
|
187 | sitemapRootUrl: "http://staging.mysite.com"
|
188 | urlBase: "http://staging.mysite.com"
|
189 | sitemapIndex: "sitemap_index.xml"
|
190 | targetDirectory: "#{process.cwd()}/tmp/sitemaps/#{config.env}/http"
|
191 | sources:
|
192 | staticUrls: (sitemapConfig) ->
|
193 | type: StaticSetSource
|
194 | options:
|
195 | siteMap:
|
196 | channel: 'static'
|
197 | changefreq: 'weekly'
|
198 | priority: 1
|
199 | options:
|
200 | urls: [
|
201 | '/',
|
202 | '/about',
|
203 | '/faq',
|
204 | '/jobs'
|
205 | ]
|
206 | serviceUrls: (sitemapConfig) ->
|
207 | type: JsonSource
|
208 | options:
|
209 | siteMap:
|
210 | changefreq: 'weekly'
|
211 | priority: 0.8
|
212 | channel: (url) -> url.category
|
213 | urlAugmenter: (url) ->
|
214 | url.url = "http://#{sitemapConfig.urlBase}/widgets/#{url.category}/#{url.url}"
|
215 | input:
|
216 | url: "http://api.mysite.com/widgets"
|
217 | options:
|
218 | filter: /urls\./
|
219 |
|
220 | module.exports = appConfig
|
221 | ```
|
222 |
|
223 | ### Logging ###
|
224 | site-mapper logs using bunyan format. To get pretty printed logs while running, just pipe
|
225 | the site-mapper output through the bunyan command line tool.
|
226 |
|
227 | ### Running the Code ###
|
228 |
|
229 | Finally, putting it all together, you can generate the sitemaps as follows:
|
230 |
|
231 | 1. Install all the dependencies:
|
232 | ```
|
233 | rm -rf node_modules;
|
234 | npm install
|
235 | ```
|
236 | 1. Run the generator:
|
237 | ```
|
238 | NODE_ENV=staging ./node_modules/.bin/site-mapper | ./node_modules/site-mapper/node_modules/.bin/bunyan
|
239 | ```
|
240 |
|
241 | Below is a make file that encapsulates the above recipe. It can be run
|
242 | by running:
|
243 |
|
244 | make setup generate
|
245 |
|
246 | ```make
|
247 | usage :
|
248 | @echo ''
|
249 | @echo 'Core tasks : Description'
|
250 | @echo '-------------------- : -----------'
|
251 | @echo 'make setup : Install dependencies'
|
252 | @echo 'make generate : Generate the sitemaps'
|
253 | @echo ''
|
254 |
|
255 | SITEMAPPER=./node_modules/.bin/site-mapper
|
256 | NPM_ARGS=
|
257 | NODE_ENV=staging
|
258 |
|
259 | setup :
|
260 | @rm -rf node_modules
|
261 | @echo npm $(NPM_ARGS) install
|
262 | @npm $(NPM_ARGS) install
|
263 |
|
264 | generate :
|
265 | @rm -rf tmp
|
266 | @NODE_ENV=$(NODE_ENV) $(SITEMAPPER)
|
267 | ```
|
268 |
|
269 | ## Sitemap Generation ##
|
270 |
|
271 | The site-mapper module views the sitemap generation process as follows:
|
272 |
|
273 | SiteMapper creates one or more Source objects, pipes each one to a Sitemap, which
|
274 | then pipes to one or more SitemapFile objects, depending on the number of urls the
|
275 | source produces and the configured maximum number of urls per SitemapFile (50,000
|
276 | by default).
|
277 |
|
278 |
|
279 | +------------+ +------------+ +--------------+ +-------------+
|
280 | | SiteMapper | | Source | | Sitemap | | SitemapFile |
|
281 | |------------| creates |------------|creates |--------------| adds |-------------|
|
282 | | +--------->| |------->| |------>| |
|
283 | | | | | urls | | urls | |
|
284 | +------------+ +------------+ +--------------+ +-------------+
|
285 |
|
286 | ### Sources ###
|
287 |
|
288 | Sources are Javascript streaming API Transform stream implementations that operate in object mode
|
289 | and produce url objects from data of a specific format. There are three included Source implementations:
|
290 |
|
291 | 1. StaticSetSource - This source is configured with a static list of urls strings
|
292 | in a configuration file.
|
293 | 1. CsvSource - Produces urls from CSV data.
|
294 | 1. JsonSource - Produces urls from JSON data.
|
295 | 1. XmlSource - Produces urls from XML data.
|
296 |
|
297 | Sources are configured with an input that produces raw text data of the right format. Source inputs can
|
298 | be either files, urls or an instantiated Readable stream object. The source reads the input
|
299 | and converts it into Url objects. The Url objects may then be filtered based on the url
|
300 | object properies and then are provided for upstream consumers in the Transform implementation.
|
301 |
|
302 | #### Source configuration details ####
|
303 |
|
304 | The configuration object for sources has the following form:
|
305 |
|
306 | ```coffee
|
307 | config =
|
308 | ignoreErrors: true|false
|
309 | input: {},
|
310 | options: {},
|
311 | siteMap: {}
|
312 | cached: {}
|
313 | ```
|
314 |
|
315 | * ignoreErrors
|
316 | Set to true if you want to log and ignore errors. If set to false (default) an error in
|
317 | the input stream aborts the entire process.
|
318 | * input object
|
319 | This object can have one of the following keys:
|
320 | 1. fileName - full path of the file containing the url data
|
321 | 2. url - URL that will produce the url data
|
322 | 3. stream - An instantiated streaming API Readable object that when read, produces url data
|
323 | * options object
|
324 | Defines source specific options (see below)
|
325 | * siteMap object
|
326 | Defines sitemap information specific to the source, like the priority of urls it produces.
|
327 | * cached object
|
328 | If present, turns on caching of the data produced by the input so that subsequent runs or
|
329 | even other sources in the configuration can use it. Contains the cacheFile attribute pointing
|
330 | at the path for the cached data and maxAge attribute, a time in milliseconds the cached data
|
331 | should be considered fresh.
|
332 |
|
333 | #### URL Channel ####
|
334 |
|
335 | Each url is associated with a channel. The channel can be a static string or
|
336 | derived from the url at runtime. In either case, the channel is specified by setting
|
337 | the channel attribute on the source configuration object's siteMap object. It can
|
338 | be either a string (static channel) or a function taking a Url object and returning
|
339 | a string.
|
340 |
|
341 | ```coffee
|
342 | sources =
|
343 | staticChannel: (sitemapConfig) ->
|
344 | siteMap:
|
345 | channel: 'foo'
|
346 | dynamicChannel: (sitemapConfig) ->
|
347 | siteMap:
|
348 | channel: (url) -> url.category
|
349 | ```
|
350 |
|
351 | The channel is used to name individual sitemap files where urls produced by the sources will
|
352 | end up. Files of the form ${CHANNEL}${SEQUENCE}.xml.gz will be created in the target directory.
|
353 |
|
354 | ### Sitemaps ###
|
355 |
|
356 | A Sitemap is created for each URL channel. As urls are added to their corresponding
|
357 | Sitemap, the group will create sequentially numbered Sitemap files, each containing a
|
358 | configurable number of urls, 50,000 by default. The name of the sitemap files is of
|
359 | the form
|
360 |
|
361 | ${CHANNEL}${SEQUENCE}.xml.gz
|
362 |
|
363 | The Channel is produced as described above
|
364 |
|
365 | IMPORTANT: If two sources are configured with the same channel, or they use a dynamic channel function
|
366 | that produces the same channel, the second source will overwrite sitemaps created by the first
|
367 | source. It is up to the user to ensure that different sources produce different channels.
|
368 |
|
369 | ### Sitemap Index and Files ###
|
370 |
|
371 | site-mapper will create a sitemap index and as many sitemap files as are required. The
|
372 | sitemap files are gzipped. There is no way to turn gzipping off.
|
373 |
|
374 | ### Publishing Sitemaps to Search Engines ###
|
375 |
|
376 | It is up to you how you expose the site maps generated by site mapper to Google and other
|
377 | search engines. Each company does this differently, so there are no default publishing
|
378 | mechanisms in site-mapper.
|
379 |
|
380 | ## Tests ##
|
381 |
|
382 | npm run test
|
383 |
|
384 | This should install all dependencies and run the test suite
|
385 |
|
386 |
|
387 | ## Sources ##
|
388 |
|
389 | ### CsvSource ###
|
390 |
|
391 | ```coffeescript
|
392 | options.options =
|
393 | columns: ['url', 'imageUrl', 'lastModified]
|
394 | relax\_column\_count: true|false
|
395 | ```
|
396 |
|
397 | - columns: Specify the names of the columns.
|
398 | - relax\_column\_count: true if you don't want an error raised if the number of columns in the csv data is different from what is configured.
|
399 |
|
400 | ### XmlSource ###
|
401 |
|
402 | ```coffeescript
|
403 | options.options =
|
404 | urlTag: 'url',
|
405 | urlAttributes:
|
406 | lastmod: 'lastModified',
|
407 | changefreq: 'changeFrequency',
|
408 | priority: 'priority',
|
409 | loc: 'url'
|
410 | ```
|
411 |
|
412 | - urlTag: Name of tags containing url data in the xml document
|
413 | - urlAttributes: map of tag name to url attribute names within the urlTag markup.
|
414 |
|
415 | Out of the box, the XmlSource is configured to read sitemap.xml files. This is useful
|
416 | if you have existing sitemaps or sitemaps generated by some other method that you
|
417 | wish to include with sitemaps generated from other data sources.
|
418 |
|
419 | ### JsonSource ###
|
420 |
|
421 | The JSON source expects an array to exist in the json data that contains objects or
|
422 | strings representing urls.
|
423 |
|
424 | ```coffeescript
|
425 | options.options =
|
426 | filter: /regex/
|
427 | transformer: (obj) ->
|
428 | stringArray: true|false
|
429 | ```
|
430 |
|
431 | - filter: A regex that is used to specify where the array(s) of urls are in the json document. See https://www.npmjs.com/package/stream-json#filter for how this works
|
432 | - transformer: A function taking the raw url object or string from the json document and turning it into an object suitable to construct Url objects
|
433 | - stringArray: Set to true if the array contains string urls rather than objects.
|
434 |
|
435 | ## Source Input Streams ##
|
436 | Sources can read data from local files (input: fileName: ''), from urls: (input: url: '') or
|
437 | from Streaming API Readable object instances (input: stream: new Foo()).
|
438 |
|
439 | site-mapper ships with the following custom input stream classes:
|
440 |
|
441 | ```coffeescript
|
442 | MultipleHttpInput({urls: ['url1', 'url2', ...]});
|
443 | ```
|
444 |
|
445 | Each url is called and it's data parsed by the underlying Source. Really only works with CSV data right now.
|
446 |
|
447 | ```coffeescript
|
448 | PaginatedHttpInput({url, pagination, format, stop})
|
449 | url: base url for http requests
|
450 | pagination:
|
451 | page: 'page' # the name of the paging parameter
|
452 | per: 'per' # the name of the limit or per page parameter
|
453 | perPage: 100 # the value of the limit or per page parameter
|
454 | increment: 1 # the amount to increase the page parameter value for each request
|
455 | start: 1 # the value of the page parameter for the first request
|
456 | format: 'json' # if 'json', the paginated responses are wrapped in a json array [ {page1response}, {page2response}, ...]
|
457 | stop:
|
458 | string: 'foo' # the string to look for in a response that indicates the pagination is over, ie there is no more data.
|
459 | ```
|