UNPKG

18.1 kBMarkdownView Raw
1# sanitize-html
2
3[![CircleCI](https://circleci.com/gh/apostrophecms/sanitize-html/tree/master.svg?style=svg)](https://circleci.com/gh/apostrophecms/sanitize-html/tree/master)
4
5<a href="https://apostrophecms.com/"><img src="https://raw.github.com/apostrophecms/sanitize-html/master/logos/logo-box-madefor.png" align="right" /></a>
6
7`sanitize-html` provides a simple HTML sanitizer with a clear API.
8
9`sanitize-html` is tolerant. It is well suited for cleaning up HTML fragments such as those created by ckeditor and other rich text editors. It is especially handy for removing unwanted CSS when copying and pasting from Word.
10
11`sanitize-html` allows you to specify the tags you want to permit, and the permitted attributes for each of those tags.
12
13If a tag is not permitted, the contents of the tag are not discarded. There are
14some exceptions to this, discussed below in the "Discarding the entire contents
15of a disallowed tag" section.
16
17The syntax of poorly closed `p` and `img` elements is cleaned up.
18
19`href` attributes are validated to ensure they only contain `http`, `https`, `ftp` and `mailto` URLs. Relative URLs are also allowed. Ditto for `src` attributes.
20
21Allowing particular urls as a `src` to an iframe tag by filtering hostnames is also supported.
22
23HTML comments are not preserved.
24
25## Requirements
26
27`sanitize-html` is intended for use with Node. That's pretty much it. All of its npm dependencies are pure JavaScript. `sanitize-html` is built on the excellent `htmlparser2` module.
28
29## How to use
30
31### Browser
32
33*Think first: why do you want to use it in the browser?* Remember, *servers must never trust browsers.* You can't sanitize HTML for saving on the server anywhere else but on the server.
34
35But, perhaps you'd like to display sanitized HTML immediately in the browser for preview. Or ask the browser to do the sanitization work on every page load. You can if you want to!
36
37* Clone repository
38* Run npm install and build / minify:
39
40```bash
41npm install
42npm run minify
43```
44
45You'll find the minified and unminified versions of sanitize-html (with all its dependencies included) in the dist/ directory.
46
47Use it in the browser:
48
49```html
50<html>
51 <body>
52 <script type="text/javascript" src="dist/sanitize-html.js"></script>
53 <script type="text/javascript" src="demo.js"></script>
54 </body>
55</html>
56```
57
58```javascript
59var html = "<strong>hello world</strong>";
60console.log(sanitizeHtml(html));
61console.log(sanitizeHtml("<img src=x onerror=alert('img') />"));
62console.log(sanitizeHtml("console.log('hello world')"));
63console.log(sanitizeHtml("<script>alert('hello world')</script>"));
64```
65
66### Node (Recommended)
67
68Install module from console:
69
70```bash
71npm install sanitize-html
72```
73
74Use it in your node app:
75
76```js
77var sanitizeHtml = require('sanitize-html');
78
79var dirty = 'some really tacky HTML';
80var clean = sanitizeHtml(dirty);
81```
82
83That will allow our default list of allowed tags and attributes through. It's a nice set, but probably not quite what you want. So:
84
85```js
86// Allow only a super restricted set of tags and attributes
87clean = sanitizeHtml(dirty, {
88 allowedTags: [ 'b', 'i', 'em', 'strong', 'a' ],
89 allowedAttributes: {
90 'a': [ 'href' ]
91 },
92 allowedIframeHostnames: ['www.youtube.com']
93});
94```
95
96Boom!
97
98#### "I like your set but I want to add one more tag. Is there a convenient way?" Sure:
99
100```js
101clean = sanitizeHtml(dirty, {
102 allowedTags: sanitizeHtml.defaults.allowedTags.concat([ 'img' ])
103});
104```
105
106If you do not specify `allowedTags` or `allowedAttributes` our default list is applied. So if you really want an empty list, specify one.
107
108#### "What are the default options?"
109
110```js
111allowedTags: [ 'h3', 'h4', 'h5', 'h6', 'blockquote', 'p', 'a', 'ul', 'ol',
112 'nl', 'li', 'b', 'i', 'strong', 'em', 'strike', 'abbr', 'code', 'hr', 'br', 'div',
113 'table', 'thead', 'caption', 'tbody', 'tr', 'th', 'td', 'pre', 'iframe' ],
114disallowedTagsMode: 'discard',
115allowedAttributes: {
116 a: [ 'href', 'name', 'target' ],
117 // We don't currently allow img itself by default, but this
118 // would make sense if we did. You could add srcset here,
119 // and if you do the URL is checked for safety
120 img: [ 'src' ]
121},
122// Lots of these won't come up by default because we don't allow them
123selfClosing: [ 'img', 'br', 'hr', 'area', 'base', 'basefont', 'input', 'link', 'meta' ],
124// URL schemes we permit
125allowedSchemes: [ 'http', 'https', 'ftp', 'mailto' ],
126allowedSchemesByTag: {},
127allowedSchemesAppliedToAttributes: [ 'href', 'src', 'cite' ],
128allowProtocolRelative: true,
129enforceHtmlBoundary: false
130```
131
132#### "What if I want to allow all tags or all attributes?"
133
134Simple! instead of leaving `allowedTags` or `allowedAttributes` out of the options, set either
135one or both to `false`:
136
137```js
138allowedTags: false,
139allowedAttributes: false
140```
141
142#### "What if I don't want to allow *any* tags?"
143
144Also simple! Set `allowedTags` to `[]` and `allowedAttributes` to `{}`.
145
146```js
147allowedTags: [],
148allowedAttributes: {}
149```
150
151### "What if I want disallowed tags to be escaped rather than discarded?"
152
153If you set `disallowedTagsMode` to `discard` (the default), disallowed tags are discarded. Any text content or subtags is still included, depending on whether the individual subtags are allowed.
154
155If you set `disallowedTagsMode` to `escape`, the disallowed tags are escaped rather than discarded. Any text or subtags is handled normally.
156
157If you set `disallowedTagsMode` to `recursiveEscape`, the disallowed tags are escaped rather than discarded, and the same treatment is applied to all subtags, whether otherwise allowed or not.
158
159### "What if I want to allow only specific values on some attributes?"
160
161When configuring the attribute in `allowedAttributes` simply use an object with attribute `name` and an allowed `values` array. In the following example `sandbox="allow-forms allow-modals allow-orientation-lock allow-pointer-lock allow-popups allow-popups-to-escape-sandbox allow-scripts"` would become `sandbox="allow-popups allow-scripts"`:
162
163```js
164 allowedAttributes: {
165 iframe: [
166 {
167 name: 'sandbox',
168 multiple: true,
169 values: ['allow-popups', 'allow-same-origin', 'allow-scripts']
170 }
171 ]
172```
173
174With `multiple: true`, several allowed values may appear in the same attribute, separated by spaces. Otherwise the attribute must exactly match one and only one of the allowed values.
175
176### Wildcards for attributes
177
178You can use the `*` wildcard to allow all attributes with a certain prefix:
179
180```javascript
181allowedAttributes: {
182 a: [ 'href', 'data-*' ]
183}
184```
185
186Also you can use the `*` as name for a tag, to allow listed attributes to be valid for any tag:
187
188```javascript
189allowedAttributes: {
190 '*': [ 'href', 'align', 'alt', 'center', 'bgcolor' ]
191}
192```
193### Discarding text outside of ```<html></html>``` tags
194
195Some text editing applications generate HTML to allow copying over to a web application. These can sometimes include undesireable control characters after terminating `html` tag. By default sanitize-html will not discard these characters, instead returning them in sanitized string. This behaviour can be modified using `enforceHtmlBoundary` option.
196
197Setting this option to true will instruct sanitize-html to discard all characters outside of `html` tag boundaries -- before `<html>` and after `</html>` tags.
198
199```javascript
200enforceHtmlBoundary: true
201```
202
203### htmlparser2 Options
204
205`santizeHtml` is built on `htmlparser2`. By default the only option passed down is `decodeEntities: true` You can set the options to pass by using the parser option.
206
207```javascript
208clean = sanitizeHtml(dirty, {
209 allowedTags: ['a'],
210 parser: {
211 lowerCaseTags: true
212 }
213});
214```
215See the [htmlparser2 wiki] (https://github.com/fb55/htmlparser2/wiki/Parser-options) for the full list of possible options.
216
217### Transformations
218
219What if you want to add or change an attribute? What if you want to transform one tag to another? No problem, it's simple!
220
221The easiest way (will change all `ol` tags to `ul` tags):
222
223```js
224clean = sanitizeHtml(dirty, {
225 transformTags: {
226 'ol': 'ul',
227 }
228});
229```
230
231The most advanced usage:
232
233```js
234clean = sanitizeHtml(dirty, {
235 transformTags: {
236 'ol': function(tagName, attribs) {
237 // My own custom magic goes here
238
239 return {
240 tagName: 'ul',
241 attribs: {
242 class: 'foo'
243 }
244 };
245 }
246 }
247});
248```
249
250You can specify the `*` wildcard instead of a tag name to transform all tags.
251
252There is also a helper method which should be enough for simple cases in which you want to change the tag and/or add some attributes:
253
254```js
255clean = sanitizeHtml(dirty, {
256 transformTags: {
257 'ol': sanitizeHtml.simpleTransform('ul', {class: 'foo'}),
258 }
259});
260```
261
262The `simpleTransform` helper method has 3 parameters:
263
264```js
265simpleTransform(newTag, newAttributes, shouldMerge)
266```
267
268The last parameter (`shouldMerge`) is set to `true` by default. When `true`, `simpleTransform` will merge the current attributes with the new ones (`newAttributes`). When `false`, all existing attributes are discarded.
269
270You can also add or modify the text contents of a tag:
271
272```js
273clean = sanitizeHtml(dirty, {
274 transformTags: {
275 'a': function(tagName, attribs) {
276 return {
277 tagName: 'a',
278 text: 'Some text'
279 };
280 }
281 }
282});
283```
284For example, you could transform a link element with missing anchor text:
285```js
286<a href="http://somelink.com"></a>
287```
288To a link with anchor text:
289```js
290<a href="http://somelink.com">Some text</a>
291```
292
293### Filters
294
295You can provide a filter function to remove unwanted tags. Let's suppose we need to remove empty `a` tags like:
296
297```html
298<a href="page.html"></a>
299```
300
301We can do that with the following filter:
302
303```javascript
304sanitizeHtml(
305 '<p>This is <a href="http://www.linux.org"></a><br/>Linux</p>',
306 {
307 exclusiveFilter: function(frame) {
308 return frame.tag === 'a' && !frame.text.trim();
309 }
310 }
311);
312```
313
314The `frame` object supplied to the callback provides the following attributes:
315
316- `tag`: The tag name, i.e. `'img'`.
317- `attribs`: The tag's attributes, i.e. `{ src: "/path/to/tux.png" }`.
318- `text`: The text content of the tag.
319- `mediaChildren`: Immediate child tags that are likely to represent self-contained media (e.g., `img`, `video`, `picture`, `iframe`). See the `mediaTags` variable in `src/index.js` for the full list.
320- `tagPosition`: The index of the tag's position in the result string.
321
322You can also process all text content with a provided filter function. Let's say we want an ellipsis instead of three dots.
323
324```html
325<p>some text...</p>
326```
327
328We can do that with the following filter:
329
330```javascript
331sanitizeHtml(
332 '<p>some text...</p>',
333 {
334 textFilter: function(text, tagName) {
335 if (['a'].indexOf(tagName) > -1) return //Skip anchor tags
336
337 return text.replace(/\.\.\./, '&hellip;');
338 }
339 }
340);
341```
342
343Note that the text passed to the `textFilter` method is already escaped for safe display as HTML. You may add markup and use entity escape sequences in your `textFilter`.
344
345### Iframe Filters
346
347If you would like to allow iframe tags but want to control the domains that are allowed through you can provide an array of hostnames that you would like to allow as iframe sources. This hostname is a property in the options object passed as an argument to the `sanitize-html` function.
348
349This array will be checked against the html that is passed to the function and return only `src` urls that include the allowed hostnames in the object. The url in the html that is passed must be formatted correctly (valid hostname) as an embedded iframe otherwise the module will strip out the src from the iframe.
350
351Make sure to pass a valid hostname along with the domain you wish to allow, i.e.:
352
353```javascript
354 allowedIframeHostnames: ['www.youtube.com', 'player.vimeo.com']
355```
356
357You may also specify whether or not to allow relative URLs as iframe sources.
358
359```javascript
360 allowIframeRelativeUrls: true
361```
362
363Note that if unspecified, relative URLs will be allowed by default if no hostname filter is provided but removed by default if a hostname filter is provided.
364
365**Remember that the `iframe` tag must be allowed as well as the `src` attribute.**
366
367For example:
368
369```javascript
370clean = sanitizeHtml('<p><iframe src="https://www.youtube.com/embed/nykIhs12345"></iframe><p>', {
371 allowedTags: [ 'p', 'em', 'strong', 'iframe' ],
372 allowedClasses: {
373 'p': [ 'fancy', 'simple' ],
374 },
375 allowedAttributes: {
376 'iframe': ['src']
377 },
378 allowedIframeHostnames: ['www.youtube.com', 'player.vimeo.com']
379});
380```
381
382will pass through as safe whereas:
383
384```javascript
385clean = sanitizeHtml('<p><iframe src="https://www.youtube.net/embed/nykIhs12345"></iframe><p>', {
386 allowedTags: [ 'p', 'em', 'strong', 'iframe' ],
387 allowedClasses: {
388 'p': [ 'fancy', 'simple' ],
389 },
390 allowedAttributes: {
391 'iframe': ['src']
392 },
393 allowedIframeHostnames: ['www.youtube.com', 'player.vimeo.com']
394});
395```
396
397or
398
399```javascript
400clean = sanitizeHtml('<p><iframe src="https://www.vimeo/video/12345"></iframe><p>', {
401 allowedTags: [ 'p', 'em', 'strong', 'iframe' ],
402 allowedClasses: {
403 'p': [ 'fancy', 'simple' ],
404 },
405 allowedAttributes: {
406 'iframe': ['src']
407 },
408 allowedIframeHostnames: ['www.youtube.com', 'player.vimeo.com']
409});
410```
411
412will return an empty iframe tag.
413
414### Allowed CSS Classes
415
416If you wish to allow specific CSS classes on a particular element, you can do so with the `allowedClasses` option. Any other CSS classes are discarded.
417
418This implies that the `class` attribute is allowed on that element.
419
420```javascript
421// Allow only a restricted set of CSS classes and only on the p tag
422clean = sanitizeHtml(dirty, {
423 allowedTags: [ 'p', 'em', 'strong' ],
424 allowedClasses: {
425 'p': [ 'fancy', 'simple' ]
426 }
427});
428```
429
430### Allowed CSS Styles
431
432If you wish to allow specific CSS _styles_ on a particular element, you can do that with the `allowedStyles` option. Simply declare your desired attributes as regular expression options within an array for the given attribute. Specific elements will inherit whitelisted attributes from the global (\*) attribute. Any other CSS classes are discarded.
433
434**You must also use `allowedAttributes`** to activate the `style` attribute for the relevant elements. Otherwise this feature will never come into play.
435
436**When constructing regular expressions, don't forget `^` and `$`.** It's not enough to say "the string should contain this." It must also say "and only this."
437
438**URLs in inline styles are NOT filtered by any mechanism other than your regular expression.**
439
440```javascript
441clean = sanitizeHtml(dirty, {
442 allowedTags: ['p'],
443 allowedAttributes: {
444 'p': ["style"],
445 },
446 allowedStyles: {
447 '*': {
448 // Match HEX and RGB
449 'color': [/^#(0x)?[0-9a-f]+$/i, /^rgb\(\s*(\d{1,3})\s*,\s*(\d{1,3})\s*,\s*(\d{1,3})\s*\)$/],
450 'text-align': [/^left$/, /^right$/, /^center$/],
451 // Match any number with px, em, or %
452 'font-size': [/^\d+(?:px|em|%)$/]
453 },
454 'p': {
455 'font-size': [/^\d+rem$/]
456 }
457 }
458 });
459```
460
461### Allowed URL schemes
462
463By default we allow the following URL schemes in cases where `href`, `src`, etc. are allowed:
464
465```js
466[ 'http', 'https', 'ftp', 'mailto' ]
467```
468
469You can override this if you want to:
470
471```javascript
472sanitizeHtml(
473 // teeny-tiny valid transparent GIF in a data URL
474 '<img src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" />',
475 {
476 allowedTags: [ 'img', 'p' ],
477 allowedSchemes: [ 'data', 'http' ]
478 }
479);
480```
481
482You can also allow a scheme for a particular tag only:
483
484```javascript
485allowedSchemes: [ 'http', 'https' ],
486allowedSchemesByTag: {
487 img: [ 'data' ]
488}
489```
490
491And you can forbid the use of protocol-relative URLs (starting with `//`) to access another site using the current protocol, which is allowed by default:
492
493```javascript
494allowProtocolRelative: false
495```
496
497### Discarding the entire contents of a disallowed tag
498
499Normally, with a few exceptions, if a tag is not allowed, all of the text within it is preserved, and so are any allowed tags within it.
500
501The exceptions are:
502
503`style`, `script`, `textarea`, `option`
504
505If you wish to replace this list, for instance to discard whatever is found
506inside a `noscript` tag, use the `nonTextTags` option:
507
508```javascript
509nonTextTags: [ 'style', 'script', 'textarea', 'option', 'noscript' ]
510```
511
512Note that if you use this option you are responsible for stating the entire list. This gives you the power to retain the content of `textarea`, if you want to.
513
514The content still gets escaped properly, with the exception of the `script` and
515`style` tags. *Allowing either `script` or `style` leaves you open to XSS
516attacks. Don't do that* unless you have good reason to trust their origin.
517sanitize-html will log a warning if these tags are allowed, which can be
518disabled with the `allowVulnerableTags: true` option.
519
520### Choose what to do with disallowed tags
521
522Instead of discarding, or keeping text only, you may enable escaping of the entire content:
523
524```javascript
525disallowedTagsMode: 'escape'
526```
527
528This will transform `<disallowed>content</disallowed>` to `&lt;disallowed&gt;content&lt;/disallowed&gt;`
529
530Valid values are: `'discard'` (default), `'escape'` (escape the tag) and `'recursiveEscape'` (to escape the tag and all its content).
531
532## About P'unk Avenue and Apostrophe
533
534`sanitize-html` was created at [P'unk Avenue](http://punkave.com) for use in ApostropheCMS, an open-source content management system built on node.js. If you like `sanitize-html` you should definitely [check out apostrophecms.org](http://apostrophecms.org).
535
536## Changelog
537
538[The changelog is now in a separate file for readability.](https://github.com/apostrophecms/sanitize-html/blob/master/CHANGELOG.md)
539
540## Support
541
542Feel free to open issues on [github](http://github.com/apostrophecms/sanitize-html).
543
544<a href="http://apostrophecms.com/"><img src="https://raw.github.com/apostrophecms/sanitize-html/master/logos/logo-box-builtby.png" /></a>