UNPKG

coffeenode-chr/README.md

Version:
21.9 kBMarkdownView Raw
1
2# Front matter
3
4The CoffeeNode `CHR` module (short for 'character') is a library for handling characters within NodeJS in a
5Unicode-compliant, Astral-Plane-aware fashion. It includes functions to split texts into characters,
6iterate over characters, and convert between a number of different character representations.
7
8## Installation
9
10Install as
11
12    npm install coffeenode-chr
13
14Require as, e.g.
15
16    CHR = require 'coffeenode-chr'
17
18**Table of Contents**
19
20- [API](#api)
21  - [Overview](#overview)
22  - [Public Members](#public-members)
23    - [analyze = ( cid_hint, options ) ->](#analyze---cid_hint-options---)
24    - [as_chr = ( cid_hint, options ) ->](#as_chr---cid_hint-options---)
25    - [as_cid = ( cid_hint, options ) ->](#as_cid---cid_hint-options---)
26    - [as_csg = ( cid_hint, options ) ->](#as_csg---cid_hint-options---)
27    - [as_fncr = ( cid_hint, options ) ->](#as_fncr---cid_hint-options---)
28    - [as_ncr = ( cid_hint, options ) ->](#as_ncr---cid_hint-options---)
29    - [as_range_name = ( cid_hint, options ) ->](#as_range_name---cid_hint-options---)
30    - [as_rsg = ( cid_hint, options ) ->](#as_rsg---cid_hint-options---)
31    - [as_sfncr = ( cid_hint, options ) ->](#as_sfncr---cid_hint-options---)
32    - [as_xncr = ( cid_hint, options ) ->](#as_xncr---cid_hint-options---)
33    - [chrs_from_text = ( text, options ) ->](#chrs_from_text---text-options---)
34    - [chunks_from_text = ( text, options ) ->](#chunks_from_text---text-options---)
35    - [html_from_text = ( text, options ) ->](#html_from_text---text-options---)
36    - [cid_from_chr = ( chr, options ) ->](#cid_from_chr---chr-options---)
37    - [csg_cid_from_chr = ( chr, options ) ->](#csg_cid_from_chr---chr-options---)
38    - [validate_is_cid = ( x ) ->](#validate_is_cid---x---)
39    - [validate_is_csg = ( x ) ->](#validate_is_csg---x---)
40  - [Private Members](#private-members)
41- [Background](#background)
42  - [JavaScript & Unicode](#javascript--unicode)
43- [Glossary](#glossary)
44  - [Character Representations](#character-representations)
45    - [Numeric Character Representation (NCR)](#numeric-character-representation-ncr)
46    - [Unicode Character Representation (UCR)](#unicode-character-representation-ucr)
47    - [Friendly NCRs](#friendly-ncrs)
48    - [Extended NCRs](#extended-ncrs)
49  - [Other Terms](#other-terms)
50    - [Character Identifier (CID)](#character-identifier-cid)
51    - [Codepoint (Codeposition)](#codepoint-codeposition)
52    - [Codeunit](#codeunit)
53    - [Character Set Sigil (CSG)](#character-set-sigil-csg)
54
55> *generated with [DocToc](http://doctoc.herokuapp.com/)*
56
57
58# API
59
60## Overview
61
62Most methods of the present library accept up to two arguments, `cid_hint` and `options`. CID is short for
63'character identifier', the integer that each codepoint in coded character sets is associated with. In
64JavaScript, the `String::charCodeAt` method is responsible for returning a given character's code—its CID.
65Other than a number, methods with said signature also accept non-empty strings.
66
67* Methods may be called with one or two arguments; the first is known as the 'CID hint', the second as
68  'options'.
69
70* The CID hint may be a number or a text; if it is a number, it is understood as a CID; if it
71  is a text, its interpretation is subject to the `options[ 'mode' ]` setting. If it is a string, it will
72  be scrutinized for its first character (according to `mode`); the rest of the text will be ignored.
73
74* `options` must be a Plain Old Dictionary (a JS object) with the optional members `input`, `output`, and
75  `csg`.
76
77* `options[ 'input' ]` is *only* observed if the CID hint is a text; it governs which kinds of character
78  references are recognized in the text. `input` may be one of `plain`, `ncr`, or `xncr`; it defaults to
79  `plain` (no character references will be recognized).
80
81* `options[ 'csg' ]` sets the character set sigil. If `csg` is set in the options, then it will override
82  whatever the outcome of `CHR.csg_cid_from_chr` w.r.t. CSG is—in other words, if you call
83  `CHR.as_sfncr '&jzr#xe100', mode: 'xncr', csg: 'u'`, you will get `u-e100`, with the numerically
84  equivalent codepoint from the `u` (Unicode) character set.
85
86* Before CSG and CID are returned, they will be validated for plausibility.
87
88## Public Members
89
90### `analyze = ( cid_hint, options ) ->`
91
92The many-tricks-pony of `coffeenode-chr`. It will return an
93object describing multiple aspects of the codepoint in question. Examples:
94
95```coffeescript
96# CHR.analyze 'helo world'
97
98{ chr:    'h',                # The first character of the text.
99  csg:    'u',                # CSG 'u', i.e. a Unicode character.
100  cid:    104,                # Codepoint: 104 = 0x68.
101  fncr:   'u-latn-68',        # The 'friendly NCR'; see 'rsg', below.
102  sfncr:  'u-68',             # The 'short friendly NCR'.
103  ncr:    '&#x68;',           # HTML-comaptible Numeric Character Reference (NCR).
104  xncr:   '&#x68;',           # For codepoints outside of Unicode, this will differ from the NCR.
105  rsg:    'u-latn' }          # The 'range sigil', i.e. Unicode block identifier.
106````
107
108When using Numerical Character References (NCRs), it is important to choose the right input mode (namely,
109`ncr` or `xncr`):
110
111```coffeescript
112# CHR.analyze '&#x24563;' # same as `CHR.analyze '&#x24563;', input: 'plain'`
113
114{ chr:    '&',
115  csg:    'u',
116  cid:    38,
117  fncr:   'u-latn-26',
118  sfncr:  'u-26',
119  ncr:    '&#x26;',
120  xncr:   '&#x26;',
121  rsg:    'u-latn' }
122
123# CHR.analyze '&#x24563;', input: 'ncr'
124
125{ chr:    '𤕣',
126  csg:    'u',
127  cid:    148835,
128  fncr:   'u-cjk-xb-24563',
129  sfncr:  'u-24563',
130  ncr:    '&#x24563;',
131  xncr:   '&#x24563;',
132  rsg:    'u-cjk-xb' }
133
134# CHR.analyze '&#x24563;', input: 'xncr'
135
136{ chr:    '𤕣',
137  csg:    'u',
138  cid:    148835,
139  fncr:   'u-cjk-xb-24563',
140  sfncr:  'u-24563',
141  ncr:    '&#x24563;',
142  xncr:   '&#x24563;',
143  rsg:    'u-cjk-xb' }
144````
145
146In the above examples, the NCR `&#x24563;` was successfully decoded in modes input `ncr` and `xncr`, while in
147`plain` mode, `&` counts as the first character of the text.
148
149Two more examples to show how to reference characters outside of Unicode: First let's analyze an 'extended
150NCR' using mode `ncr`. The result is that, since `&jzr#x24563;` violates the rules for NCRs, it is not
151recognized and treated as an ordinary text (the same way browsers do it). The first character of that text
152is an `&` ampersand, so all we get is a description of that character in mode `ncr`:
153
154```coffeescript
155# CHR.analyze '&jzr#xe100;', input: 'ncr'
156
157{ chr:    '&',
158  csg:    'u',
159  cid:    38,
160  fncr:   'u-latn-26',
161  sfncr:  'u-26',
162  ncr:    '&#x26;',
163  xncr:   '&#x26;',
164  rsg:    'u-latn' }
165````
166
167When we switch to mode `xncr`, the extended NCR is properly detected:
168
169```coffeescript
170# CHR.analyze '&jzr#xe100;', input: 'xncr'
171
172{ chr: '&jzr#xe100;',       # the XNCR has been recognized.
173  csg: 'jzr',               # The CSG identifies the Jizura Character Set (JZRCS).
174  cid: 57600,               # CID is 57600 = 0xe100
175  fncr: 'jzr-fig-e100',     # The FNCR tells us that the codepoint is in the 'fig' block of the JZRCS.
176  sfncr: 'jzr-e100',
177  ncr: '&#xe100;',          # When rendering to a web page, we must use standard NCRs.
178  xncr: '&jzr#xe100;',      # In plain texts, databases and so on, we may wish to use this notation.
179  rsg: 'jzr-fig' }
180````
181
182This information may be used to properly render the codepoint in question, say, as
183`<span class='jzr-fig'>&#xe100;</span>`,
184alongside with suitable CSS rules that tell the browser which font to use.
185
186
187### `as_chr = ( cid_hint, options ) ->`
188
189### `as_cid = ( cid_hint, options ) ->`
190
191### `as_csg = ( cid_hint, options ) ->`
192
193### `as_fncr = ( cid_hint, options ) ->`
194
195### `as_ncr = ( cid_hint, options ) ->`
196
197### `as_range_name = ( cid_hint, options ) ->`
198
199### `as_rsg = ( cid_hint, options ) ->`
200
201### `as_sfncr = ( cid_hint, options ) ->`
202
203### `as_xncr = ( cid_hint, options ) ->`
204
205### `chrs_from_text = ( text, options ) ->`
206
207Given a `text` and `options`, return a list of characters. The only relevant setting of `options` is `mode`,
208which governs (as in most methods) whether NCRs/XNCRs are recognized. When applying a CHR method such as
209`CHR.analyze` to each element of the resulting list (with the same `mode` setting), a complete analysis of
210a text can be perfomed.
211
212### `chunks_from_text = ( text, options ) ->`
213
214Given a `text` and `options` (of which only `mode` is relevant here), return a list of `CHR/chunk`
215objects (as returned by `CHR._new_chunk`) that describes stretches of characters with codepoints in the
216same 'range' (Unicode block).
217
218### `html_from_text = ( text, options ) ->`
219
220### `cid_from_chr = ( chr, options ) ->`
221
222### `csg_cid_from_chr = ( chr, options ) ->`
223
224### `validate_is_cid = ( x ) ->`
225
226### `validate_is_csg = ( x ) ->`
227<!-- **cid\_from\_fncr = ( ) ->**
228
229**cid\_from\_ncr = ( ) ->**
230
231**cid\_from\_xncr = ( ) ->**
232 -->
233
234## Private Members
235
236```coffeescript
237_analyze = ( csg, cid ) ->
238_as_fncr = ( csg, cid ) ->
239_as_range_name = ( csg, cid ) ->
240_as_rsg = ( csg, cid ) ->
241_as_sfncr = ( csg, cid ) ->
242_as_xncr = ( csg, cid ) ->
243_chr_csg_cid_from_chr = ( chr, mode ) ->
244_csg_cid_from_hint = ( cid_hint, options ) ->
245_names_and_ranges_by_csg = unicode_blocks_data[ 'names-and-ranges-by-csg' ]
246_unicode_chr_from_cid = ( cid ) ->
247
248_csg_matcher
249_first_chr_matcher_ncr
250_first_chr_matcher_plain
251_first_chr_matcher_xncr
252_ncr_csg_cid_matcher
253_ncr_matcher
254_ncr_splitter
255_nonsurrogate_matcher
256_plain_splitter
257_surrogate_matcher
258_xncr_csg_cid_matcher
259_xncr_matcher
260_xncr_splitter
261````
262
263# Background
264
265<!-- ## Unicode characters & codepoints -->
266
267## JavaScript & Unicode
268
269**When JavaScript was conceived and standardized** in 1994/95, the Unicode standard was still in its infancy.
270Early design plans for a Universal Character Set had argued that 2<sup>16</sup> or even a mere
2712<sup>14</sup> codepoints should be more than sufficient to represent all characters in current use—it was
272only in 1996 that the Unicode Consortium acknowledged the need for a far bigger number of codepoints, and
273hence pushed the highest valid codepoint position from `0xffff` to `0x10ffff`; there are, consequently,
2741,114,112 codepoints available in Unicode from version 2.0 onwards. In the vernacular, Unicode codepoints
275beyond `0xffff` (the frontier of the Basic Multilingual Plane, BMP) are known as '32bit characters', and
276the wilderness out there as 'The Astral Planes'.
277
278**In order to remain backwards compatible** to those applications which had been implemented under the
279premise that each Unicode codepoint would be representable within 16 bits, a **'Surrogate Character'**
280mechanism was introduced alongside with the extension of the character set. To understand Surrogate
281Characters, imagine the same had happened to ASCII—an 8bit character set that leaves all codepoints using
282the 8<sup>th</sup> bit undefined. We could go and make those 128 codepoints surrogate characters,
283stipulating that the 64 codepoints in the range  `0x80`...`0xbf` shall serve as 'High (or Leading)
284Surrogates', those in the range `0xc0`...`0xff` as 'Low (or Trailing) Surrogates', and that High and Low
285Surrogates shall always appear as pairs. In essence, then, every surrogate pair `HL` then represents a
286number written out in a positional system with the base 64 (with the offsets `0x80` and `0xc0`)—that is, we
287can project it onto codepoints beyond `0xff` (255) by saying that a surrogate pair `HL` is a way to access
288codepoint `( H — 0x80 ) * 64 + ( L — 0xc0 ) + 255`, allowing us to express around 4000 additional codepoints
289in the range `0x100`...`0x10fe` from within the confines of an 8bit character set!
290
291Of course, such a move would have **broken a lot of assumptions**, such as the Olde Saw that 'a character is
292a codepoint, a codepoint is a byte', a truism in all early encoding schemes. In Unicode, a 'character' (a
293unit of writing) is distinct from a 'glyph' (the graphical appearance of a character), and while each
294'glyph' is mapped to a 'codepoint', many virtually indistinguishable glyphs may be mapped to disparate
295codepoints (for a number of reasons—some good, some bad, some historically justified, some mistaken).
296
297When working with Unicode, it is important to be aware of the fact that **only up to 256 Unicode codepoints
298are maximally representable within a 'byte'—a scant 0.256% of the ≈100,000 codepoints in use** (as of
299Unicode&nbsp;6. When using UTF-8 as an encoding, there are actually a mere 128 codepoints left that occupy
300one byte only). **Understanding the codepoint / character / byte schism is essential for any programmer
301wanting to process text**.
302
303Now **JavaScript *was* fairly advanced for its time** in that its text data type—the `String`—was defined in
304terms of Unicode characters at a time when the programming community at large was still happily hacking
305bytes, and web designers spit out HTML pages using ISO&nbsp;8859-1 (by comparison, Python, which was
306conceived in 1991, got a Unicode data type only in 2000, and abolished its 8bit string type in 2008—until
307then, a good many Python programmers operated on bytes rather than codepoints when manipulating text in that
308language).
309
310Since JavaScript's `String` data type is character-oriented and Array-like, it is very convenient to, say,
311iterate over characters in a text. It is as simple as fetching the length of the string and iterate over
312its indexes:
313
314    var chr_count = text.length;
315    for ( var idx = 0; idx < chr_count; idx += 1 ) {
316      log( text[ idx ] ); }
317
318Where JavaScript is something of **a bit of a failure** is its treatment of those elusive 32bit / astral
319codepoints. Sadly, for all its simplicity and correctness in 95% of all practically occurring situations
320in day-to-day programming situations, the above code snippets is **not correct**, as it fails to account
321for the fact that **in JavaScript, 32bit characters occupy two `String` positions**. We can easily test this:
322
323    var show_codepoints = function( text ) {
324        var chr;
325        var chr_count = text.length;
326        for ( var idx = 0; idx < chr_count; idx += 1 ) {
327          chr = text[ idx ];
328          log( chr, idx, '0x' + chr.charCodeAt( 0 ).toString( 16 ) ); } };
329
330    var text = '自強不息';
331    show_codepoints( text );
332
333gives
334
335    自 0 0x81ea
336    強 1 0x5f37
337    不 2 0x4e0d
338    息 3 0x606f
339
340which is correct, since all codepoints are below `0xffff`. However, should a text inadvertently include
341astral entities, the algorithm breaks down in nasty ways:
342
343    var text = '𤕣古文龜';
344    show_codepoints( text );
345
346gives
347
348    � 0 0xd851
349    � 1 0xdd63
350    古 2 0x53e4
351    文 3 0x6587
352    龜 4 0x9f9c
353
354Like in an X-ray, we see that the string now holds *five* positions, in spite of there being only *four*
355characters, as in the previous one. The first two units are rendered with the Unicode Replacement
356Character (which itself—confusingly—has a codepoint, `0xffd`; in essence, all codepoints deemed illegal are
357mapped to `0xfffd` at some point on their way through the output pipeline), as they are *no legal
358codepoints* when occurring without a suitable partner.
359
360> The astute reader will notice a conundrum here: we've just proven that the character `𤕣` is stored as two
361> units in JavaScript, rather than one. Still, we managed to get it from an UTF-8 encoded sourcefile
362> through the runtime and back out onto the console correctly—shouldn't JavaScript's treatment have
363> destroyed the character by splitting it up? The answer is: Yes, that should have happened, and it is
364> indeed what did happen in NodeJS versions prior to 7.7 (i believe), and also in earlier versions of
365> Chrome. Fortunately, there has been fixed by making it so that input and output routines perform
366> 'ensurrogating' and 'desurrogating' on the character streams, with the effect that Joe the Programmer has
367> one problem less now.
368
369However, the numbers shown (`0xd851` and `0xdd63`) indicate that JavaScript *can* deal with those positions,
370it just cannot print out a suitable glyph for them. Indeed, when we apply the formula for Unicode Surrogate
371Pair conversion to the first two codepoints in the critical text:
372
373    H = 0xd851
374    L = 0xdd63
375    codepoint = ( H - 0xD800 ) * 0x400 + ( L - 0xDC00 ) + 0x10000
376
377we find that the result `0x24563` does point to 𤕣 (an archaic variant of the modern Chinese character 龜). So
378it becomes clear that JavaScript *can* deal with 'astral texts' (rendering it in what has become known as
379*mojibake* 文字化け or *krüsel-krüsel*)—provided that programmers do respect surrogates.
380
381This leaves code wranglers with a truly confusing and sometimes frustrating situation: First, we had to
382swallow that **a byte is not a character** (anyone who spent their time between 2000 and 2010 trying to
383'make everything work' in Python versions that had *both* an 8bit `str` type *and* a 16bit `Unicode` type
384will know what i'm talking about). Next, we have to swallow that (in Java, in JavaScript, and in some
385versions of Python, depending on compilation flags) even **a single character may or may not be a single
386codepoint**; instead, **characters with codepoints above `0xffff` are represented as two 'Code Units'**, one
387more term to learn here.
388
389<!--
390## UTF-8 & CESU-8
391
392UTF-8 (an abbreviation for 'Unified Character Set Transformation Format, 8-bit') was invented in 1992 to
393solve the problem of transmitting and storing Unicode texts using the available octet-(byte-)based file and
394wire formats of the time. It has since become the undisputed standard encoding for web pages, URLs, and
395text files.
396
397Since codepoints are represented as positive integers, and octets (bytes) can be used to represent numbers
398between zero and 255, the problem that UTF-8 attempts to solve boils down to representing 'big' integers
399(those greater than 255) as series of octets.
400
401I won't go into detail here—there's a good (Wikipedia)[http://en.wikipedia.org/wiki/UTF-8] article
402on how UTF-8 mangles bits. The one important thing to understand when it comes to JavaScript and UTF-8
403is that
404
405
406 -->
407
408# Glossary
409
410## Character Representations
411
412It is often necessary to convert between different character representations. For example, the character `強`
413may be represented as `&#x5f37;` in HTML—this can help when the source text must be edited in a
414Unicode-unfriendly environment. Likewise, the Unicode Consortium identifies codepoints using the `U+`
415notation, e.g. `U+5F37;`.
416
417### Numeric Character Representation (NCR)
418
419The Numeric(al) Character Representation (NCR) format was invented to represent 'difficult' characters in
420SGML, a markup format designed in 1960s which became the ancestor of both XML and HTML. An NCR consists of
421an ampersand&nbsp;`&` followed by a hash&nbsp;`#`, followed by the numerical codepoint identifier (CID,
422otherwise known as simple 'a codepoint') of the character in question, and closed with a semicolon&nbsp;`;`.
423The CID may be written out in decimal or hexadecimal, upper- or lowercase, and with optional leading zeros.
424If written in hexadecimal, an `x` must be placed between the hash&nbsp;`#` and the CID: thus, `ਵ`, named
425GURMUKHI LETTER VA, may be represented as `&#x0A35;`, `&#xA35;`, `&#x0a35;`, `&#2613;`, `&#02613;`. (The
426flexibility of these rules and the plethora of possible variants is somewhat of a hallmark of earlier
427computing standards; other examples for this phenomenon are email- and IP-addresses.)
428
429### Unicode Character Representation (UCR)
430
431The Unicode Consortium's Character Representation (UCR) format is used by the Unicode Consortium in its
432publications. It consists of an uppercase&nbsp;`U`, followed by a plus sign&nbsp;`+`, followed by the CID of
433the character in question. The CID is invariably written out in uppercase hexadecimal; it is padded with
434zeros when shorter than four digits; otherwise, it consists of five or six digits as needed. For example,
435`ਵ` is represented as `U+0A35`.
436
437### Friendly NCRs
438
439When having to reference and identify characters in explanatory texts, i personally like to write out the
440codepoint in a fashion that is both somewhat less 'formal' than NCRs and somewhat more readable, flexible
441and informative than UCRs; i call this format FNCR for Friendly Numeric Characer Representation. It is
442mainly intended for use in publications such as character references, where a notice should be made for the
443reader how to decode the constituent parts of the notation.
444
445The general format of an FNCR is as follows: first, the character set is indicated by a short string of
446letters, `u` being reserved for Unicode; this part is called the Character Set siGil (CSG). Then, the
447relevant subset or block of the position of the codepoint in the character set is identified by a so-called
448Range SiGil (RSG). Last, the CID is written out in lowercase hexadecimal. The parts of the FNCR are joined
449by hyphens `-`. Here are a few examples:
450
451    ਵ      u-guru-a35      # guru:  (ISO code for) 'Gurmukhi'
452    強     u-cjk-5f37      # cjk:   Unicode block 'CJK Unified Ideographs'
453    𤕣     u-cjkxb-24563   # cjkxb: Unicode block 'CJK Ideograph Extension B'
454    €      u-cur-20ac      # cur:   Unicode block 'Currency Symbols'
455
456RSGs are important for big character sets such as Unicode, where tens of thousands of characters are
457distributed over hundred of blocks—it is easy to loose orientation. Since FNCRs include a character set
458sigil, codepoints from multiple character sets may be identified; for example, here we use `l9` to stand for
459Latin-9, otherwise known as ISO 8859-15, and `cp1252` for Windows Codepage 1252:
460
461    €    = u-cur-20ac
462         = l9-a4
463         = cp1252-80
464
465XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
466
467### Extended NCRs
468
469## Other Terms
470
471### Character Identifier (CID)
472
473Probably differs from the concept as used by [Adobe](http://en.wikipedia.org/wiki/PostScript_fonts#CID)
474
475### Codepoint (Codeposition)
476
477### Codeunit
478
479### Character Set Sigil (CSG)
480
481
482
483
\No newline at end of file