UNPKG

sax/README.md

Version:

8.07 kBMarkdownView Raw

1# sax js
2
3A sax-style parser for XML and HTML.
4
5Designed with [node](http://nodejs.org/) in mind, but should work fine in
6the browser or other CommonJS implementations.
7
8## What This Is
9
10* A very simple tool to parse through an XML string.
11* A stepping stone to a streaming HTML parser.
12* A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML 
13  docs.
14
15## What This Is (probably) Not
16
17* An HTML Parser - That's a fine goal, but this isn't it.  It's just
18  XML.
19* A DOM Builder - You can use it to build an object model out of XML,
20  but it doesn't do that out of the box.
21* XSLT - No DOM = no querying.
22* 100% Compliant with (some other SAX implementation) - Most SAX
23  implementations are in Java and do a lot more than this does.
24* An XML Validator - It does a little validation when in strict mode, but
25  not much.
26* A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic 
27  masochism.
28* A DTD-aware Thing - Fetching DTDs is a much bigger job.
29
30## Regarding `<!DOCTYPE`s and `<!ENTITY`s
31
32The parser will handle the basic XML entities in text nodes and attribute
33values: `&amp; &lt; &gt; &apos; &quot;`. It's possible to define additional
34entities in XML by putting them in the DTD. This parser doesn't do anything
35with that. If you want to listen to the `ondoctype` event, and then fetch
36the doctypes, and read the entities and add them to `parser.ENTITIES`, then
37be my guest.
38
39Unknown entities will fail in strict mode, and in loose mode, will pass
40through unmolested.
41
42## Usage
43
44    var sax = require("./lib/sax"),
45      strict = true, // set to false for html-mode
46      parser = sax.parser(strict);
47
48    parser.onerror = function (e) {
49      // an error happened.
50    };
51    parser.ontext = function (t) {
52      // got some text.  t is the string of text.
53    };
54    parser.onopentag = function (node) {
55      // opened a tag.  node has "name" and "attributes"
56    };
57    parser.onattribute = function (attr) {
58      // an attribute.  attr has "name" and "value"
59    };
60    parser.onend = function () {
61      // parser stream is done, and ready to have more stuff written to it.
62    };
63
64    parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
65
66    // stream usage
67    // takes the same options as the parser
68    var saxStream = require("sax").createStream(strict, options)
69    saxStream.on("error", function (e) {
70      // unhandled errors will throw, since this is a proper node
71      // event emitter.
72      console.error("error!", e)
73      // clear the error
74      this._parser.error = null
75      this._parser.resume()
76    })
77    saxStream.on("opentag", function (node) {
78      // same object as above
79    })
80    // pipe is supported, and it's readable/writable
81    // same chunks coming in also go out.
82    fs.createReadStream("file.xml")
83      .pipe(saxStream)
84      .pipe(fs.createReadStream("file-copy.xml"))
85
86
87
88## Arguments
89
90Pass the following arguments to the parser function.  All are optional.
91
92`strict` - Boolean. Whether or not to be a jerk. Default: `false`.
93
94`opt` - Object bag of settings regarding string formatting.  All default to `false`.
95
96Settings supported:
97
98* `trim` - Boolean. Whether or not to trim text and comment nodes.
99* `normalize` - Boolean. If true, then turn any whitespace into a single
100  space.
101* `lowercase` - Boolean. If true, then lowercase tag names and attribute names
102  in loose mode, rather than uppercasing them.
103* `xmlns` - Boolean. If true, then namespaces are supported.
104* `position` - Boolean. If false, then don't track line/col/position.
105
106## Methods
107
108`write` - Write bytes onto the stream. You don't have to do this all at
109once. You can keep writing as much as you want.
110
111`close` - Close the stream. Once closed, no more data may be written until
112it is done processing the buffer, which is signaled by the `end` event.
113
114`resume` - To gracefully handle errors, assign a listener to the `error`
115event. Then, when the error is taken care of, you can call `resume` to
116continue parsing. Otherwise, the parser will not continue while in an error
117state.
118
119## Members
120
121At all times, the parser object will have the following members:
122
123`line`, `column`, `position` - Indications of the position in the XML
124document where the parser currently is looking.
125
126`startTagPosition` - Indicates the position where the current tag starts.
127
128`closed` - Boolean indicating whether or not the parser can be written to.
129If it's `true`, then wait for the `ready` event to write again.
130
131`strict` - Boolean indicating whether or not the parser is a jerk.
132
133`opt` - Any options passed into the constructor.
134
135`tag` - The current tag being dealt with.
136
137And a bunch of other stuff that you probably shouldn't touch.
138
139## Events
140
141All events emit with a single argument. To listen to an event, assign a
142function to `on<eventname>`. Functions get executed in the this-context of
143the parser object. The list of supported events are also in the exported
144`EVENTS` array.
145
146When using the stream interface, assign handlers using the EventEmitter
147`on` function in the normal fashion.
148
149`error` - Indication that something bad happened. The error will be hanging
150out on `parser.error`, and must be deleted before parsing can continue. By
151listening to this event, you can keep an eye on that kind of stuff. Note:
152this happens *much* more in strict mode. Argument: instance of `Error`.
153
154`text` - Text node. Argument: string of text.
155
156`doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
157
158`processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
159object with `name` and `body` members. Attributes are not parsed, as
160processing instructions have implementation dependent semantics.
161
162`sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
163would trigger this kind of event. This is a weird thing to support, so it
164might go away at some point. SAX isn't intended to be used to parse SGML,
165after all.
166
167`opentag` - An opening tag. Argument: object with `name` and `attributes`.
168In non-strict mode, tag names are uppercased, unless the `lowercase`
169option is set.  If the `xmlns` option is set, then it will contain
170namespace binding information on the `ns` member, and will have a
171`local`, `prefix`, and `uri` member.
172
173`closetag` - A closing tag. In loose mode, tags are auto-closed if their
174parent closes. In strict mode, well-formedness is enforced. Note that
175self-closing tags will have `closeTag` emitted immediately after `openTag`.
176Argument: tag name.
177
178`attribute` - An attribute node.  Argument: object with `name` and `value`.
179In non-strict mode, attribute names are uppercased, unless the `lowercase`
180option is set.  If the `xmlns` option is set, it will also contains namespace
181information.
182
183`comment` - A comment node.  Argument: the string of the comment.
184
185`opencdata` - The opening tag of a `<![CDATA[` block.
186
187`cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
188quite large, this event may fire multiple times for a single block, if it
189is broken up into multiple `write()`s. Argument: the string of random
190character data.
191
192`closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
193
194`opennamespace` - If the `xmlns` option is set, then this event will
195signal the start of a new namespace binding.
196
197`closenamespace` - If the `xmlns` option is set, then this event will
198signal the end of a namespace binding.
199
200`end` - Indication that the closed stream has ended.
201
202`ready` - Indication that the stream has reset, and is ready to be written
203to.
204
205`noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
206event, and their contents are not checked for special xml characters.
207If you pass `noscript: true`, then this behavior is suppressed.
208
209## Reporting Problems
210
211It's best to write a failing test if you find an issue.  I will always
212accept pull requests with failing tests if they demonstrate intended
213behavior, but it is very hard to figure out what issue you're describing
214without a test.  Writing a test is also the best way for you yourself
215to figure out if you really understand the issue you think you have with
216sax-js.

1	`# sax js`
2
3	`A sax-style parser for XML and HTML.`
4
5	`Designed with [node](http://nodejs.org/) in mind, but should work fine in`
6	`the browser or other CommonJS implementations.`
7
8	`## What This Is`
9
10	`* A very simple tool to parse through an XML string.`
11	`* A stepping stone to a streaming HTML parser.`
12	`* A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML`
13	`docs.`
14
15	`## What This Is (probably) Not`
16
17	`* An HTML Parser - That's a fine goal, but this isn't it. It's just`
18	`XML.`
19	`* A DOM Builder - You can use it to build an object model out of XML,`
20	`but it doesn't do that out of the box.`
21	`* XSLT - No DOM = no querying.`
22	`* 100% Compliant with (some other SAX implementation) - Most SAX`
23	`implementations are in Java and do a lot more than this does.`
24	`* An XML Validator - It does a little validation when in strict mode, but`
25	`not much.`
26	`* A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic`
27	`masochism.`
28	`* A DTD-aware Thing - Fetching DTDs is a much bigger job.`
29
30	## Regarding `<!DOCTYPE`s and `<!ENTITY`s
31
32	`The parser will handle the basic XML entities in text nodes and attribute`
33	values: `& < > ' "`. It's possible to define additional
34	`entities in XML by putting them in the DTD. This parser doesn't do anything`
35	with that. If you want to listen to the `ondoctype` event, and then fetch
36	the doctypes, and read the entities and add them to `parser.ENTITIES`, then
37	`be my guest.`
38
39	`Unknown entities will fail in strict mode, and in loose mode, will pass`
40	`through unmolested.`
41
42	`## Usage`
43
44	`var sax = require("./lib/sax"),`
45	`strict = true, // set to false for html-mode`
46	`parser = sax.parser(strict);`
47
48	`parser.onerror = function (e) {`
49	`// an error happened.`
50	`};`
51	`parser.ontext = function (t) {`
52	`// got some text. t is the string of text.`
53	`};`
54	`parser.onopentag = function (node) {`
55	`// opened a tag. node has "name" and "attributes"`
56	`};`
57	`parser.onattribute = function (attr) {`
58	`// an attribute. attr has "name" and "value"`
59	`};`
60	`parser.onend = function () {`
61	`// parser stream is done, and ready to have more stuff written to it.`
62	`};`
63
64	`parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();`
65
66	`// stream usage`
67	`// takes the same options as the parser`
68	`var saxStream = require("sax").createStream(strict, options)`
69	`saxStream.on("error", function (e) {`
70	`// unhandled errors will throw, since this is a proper node`
71	`// event emitter.`
72	`console.error("error!", e)`
73	`// clear the error`
74	`this._parser.error = null`
75	`this._parser.resume()`
76	`})`
77	`saxStream.on("opentag", function (node) {`
78	`// same object as above`
79	`})`
80	`// pipe is supported, and it's readable/writable`
81	`// same chunks coming in also go out.`
82	`fs.createReadStream("file.xml")`
83	`.pipe(saxStream)`
84	`.pipe(fs.createReadStream("file-copy.xml"))`
85
86
87
88	`## Arguments`
89
90	`Pass the following arguments to the parser function. All are optional.`
91
92	`strict` - Boolean. Whether or not to be a jerk. Default: `false`.
93
94	`opt` - Object bag of settings regarding string formatting. All default to `false`.
95
96	`Settings supported:`
97
98	* `trim` - Boolean. Whether or not to trim text and comment nodes.
99	* `normalize` - Boolean. If true, then turn any whitespace into a single
100	`space.`
101	* `lowercase` - Boolean. If true, then lowercase tag names and attribute names
102	`in loose mode, rather than uppercasing them.`
103	* `xmlns` - Boolean. If true, then namespaces are supported.
104	* `position` - Boolean. If false, then don't track line/col/position.
105
106	`## Methods`
107
108	`write` - Write bytes onto the stream. You don't have to do this all at
109	`once. You can keep writing as much as you want.`
110
111	`close` - Close the stream. Once closed, no more data may be written until
112	it is done processing the buffer, which is signaled by the `end` event.
113
114	`resume` - To gracefully handle errors, assign a listener to the `error`
115	event. Then, when the error is taken care of, you can call `resume` to
116	`continue parsing. Otherwise, the parser will not continue while in an error`
117	`state.`
118
119	`## Members`
120
121	`At all times, the parser object will have the following members:`
122
123	`line`, `column`, `position` - Indications of the position in the XML
124	`document where the parser currently is looking.`
125
126	`startTagPosition` - Indicates the position where the current tag starts.
127
128	`closed` - Boolean indicating whether or not the parser can be written to.
129	If it's `true`, then wait for the `ready` event to write again.
130
131	`strict` - Boolean indicating whether or not the parser is a jerk.
132
133	`opt` - Any options passed into the constructor.
134
135	`tag` - The current tag being dealt with.
136
137	`And a bunch of other stuff that you probably shouldn't touch.`
138
139	`## Events`
140
141	`All events emit with a single argument. To listen to an event, assign a`
142	function to `on<eventname>`. Functions get executed in the this-context of
143	`the parser object. The list of supported events are also in the exported`
144	`EVENTS` array.
145
146	`When using the stream interface, assign handlers using the EventEmitter`
147	`on` function in the normal fashion.
148
149	`error` - Indication that something bad happened. The error will be hanging
150	out on `parser.error`, and must be deleted before parsing can continue. By
151	`listening to this event, you can keep an eye on that kind of stuff. Note:`
152	this happens much more in strict mode. Argument: instance of `Error`.
153
154	`text` - Text node. Argument: string of text.
155
156	`doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
157
158	`processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
159	object with `name` and `body` members. Attributes are not parsed, as
160	`processing instructions have implementation dependent semantics.`
161
162	`sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
163	`would trigger this kind of event. This is a weird thing to support, so it`
164	`might go away at some point. SAX isn't intended to be used to parse SGML,`
165	`after all.`
166
167	`opentag` - An opening tag. Argument: object with `name` and `attributes`.
168	In non-strict mode, tag names are uppercased, unless the `lowercase`
169	option is set. If the `xmlns` option is set, then it will contain
170	namespace binding information on the `ns` member, and will have a
171	`local`, `prefix`, and `uri` member.
172
173	`closetag` - A closing tag. In loose mode, tags are auto-closed if their
174	`parent closes. In strict mode, well-formedness is enforced. Note that`
175	self-closing tags will have `closeTag` emitted immediately after `openTag`.
176	`Argument: tag name.`
177
178	`attribute` - An attribute node. Argument: object with `name` and `value`.
179	In non-strict mode, attribute names are uppercased, unless the `lowercase`
180	option is set. If the `xmlns` option is set, it will also contains namespace
181	`information.`
182
183	`comment` - A comment node. Argument: the string of the comment.
184
185	`opencdata` - The opening tag of a `<![CDATA[` block.
186
187	`cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
188	`quite large, this event may fire multiple times for a single block, if it`
189	is broken up into multiple `write()`s. Argument: the string of random
190	`character data.`
191
192	`closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
193
194	`opennamespace` - If the `xmlns` option is set, then this event will
195	`signal the start of a new namespace binding.`
196
197	`closenamespace` - If the `xmlns` option is set, then this event will
198	`signal the end of a namespace binding.`
199
200	`end` - Indication that the closed stream has ended.
201
202	`ready` - Indication that the stream has reset, and is ready to be written
203	`to.`
204
205	`noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
206	`event, and their contents are not checked for special xml characters.`
207	If you pass `noscript: true`, then this behavior is suppressed.
208
209	`## Reporting Problems`
210
211	`It's best to write a failing test if you find an issue. I will always`
212	`accept pull requests with failing tests if they demonstrate intended`
213	`behavior, but it is very hard to figure out what issue you're describing`
214	`without a test. Writing a test is also the best way for you yourself`
215	`to figure out if you really understand the issue you think you have with`
216	`sax-js.`