1 |
|
2 | # Front matter
|
3 |
|
4 | The CoffeeNode `CHR` module (short for 'character') is a library for handling characters within NodeJS in a
|
5 | Unicode-compliant, Astral-Plane-aware fashion. It includes functions to split texts into characters,
|
6 | iterate over characters, and convert between a number of different character representations.
|
7 |
|
8 | ## Installation
|
9 |
|
10 | Install as
|
11 |
|
12 | npm install coffeenode-chr
|
13 |
|
14 | Require as, e.g.
|
15 |
|
16 | CHR = require 'coffeenode-chr'
|
17 |
|
18 | **Table of Contents**
|
19 |
|
20 | - [API](#api)
|
21 | - [Overview](#overview)
|
22 | - [Public Members](#public-members)
|
23 | - [analyze = ( cid_hint, options ) ->](#analyze---cid_hint-options---)
|
24 | - [as_chr = ( cid_hint, options ) ->](#as_chr---cid_hint-options---)
|
25 | - [as_cid = ( cid_hint, options ) ->](#as_cid---cid_hint-options---)
|
26 | - [as_csg = ( cid_hint, options ) ->](#as_csg---cid_hint-options---)
|
27 | - [as_fncr = ( cid_hint, options ) ->](#as_fncr---cid_hint-options---)
|
28 | - [as_ncr = ( cid_hint, options ) ->](#as_ncr---cid_hint-options---)
|
29 | - [as_range_name = ( cid_hint, options ) ->](#as_range_name---cid_hint-options---)
|
30 | - [as_rsg = ( cid_hint, options ) ->](#as_rsg---cid_hint-options---)
|
31 | - [as_sfncr = ( cid_hint, options ) ->](#as_sfncr---cid_hint-options---)
|
32 | - [as_xncr = ( cid_hint, options ) ->](#as_xncr---cid_hint-options---)
|
33 | - [chrs_from_text = ( text, options ) ->](#chrs_from_text---text-options---)
|
34 | - [chunks_from_text = ( text, options ) ->](#chunks_from_text---text-options---)
|
35 | - [html_from_text = ( text, options ) ->](#html_from_text---text-options---)
|
36 | - [cid_from_chr = ( chr, options ) ->](#cid_from_chr---chr-options---)
|
37 | - [csg_cid_from_chr = ( chr, options ) ->](#csg_cid_from_chr---chr-options---)
|
38 | - [validate_is_cid = ( x ) ->](#validate_is_cid---x---)
|
39 | - [validate_is_csg = ( x ) ->](#validate_is_csg---x---)
|
40 | - [Private Members](#private-members)
|
41 | - [Background](#background)
|
42 | - [JavaScript & Unicode](#javascript--unicode)
|
43 | - [Glossary](#glossary)
|
44 | - [Character Representations](#character-representations)
|
45 | - [Numeric Character Representation (NCR)](#numeric-character-representation-ncr)
|
46 | - [Unicode Character Representation (UCR)](#unicode-character-representation-ucr)
|
47 | - [Friendly NCRs](#friendly-ncrs)
|
48 | - [Extended NCRs](#extended-ncrs)
|
49 | - [Other Terms](#other-terms)
|
50 | - [Character Identifier (CID)](#character-identifier-cid)
|
51 | - [Codepoint (Codeposition)](#codepoint-codeposition)
|
52 | - [Codeunit](#codeunit)
|
53 | - [Character Set Sigil (CSG)](#character-set-sigil-csg)
|
54 |
|
55 | > *generated with [DocToc](http://doctoc.herokuapp.com/)*
|
56 |
|
57 |
|
58 | # API
|
59 |
|
60 | ## Overview
|
61 |
|
62 | Most methods of the present library accept up to two arguments, `cid_hint` and `options`. CID is short for
|
63 | 'character identifier', the integer that each codepoint in coded character sets is associated with. In
|
64 | JavaScript, the `String::charCodeAt` method is responsible for returning a given character's code—its CID.
|
65 | Other than a number, methods with said signature also accept non-empty strings.
|
66 |
|
67 | * Methods may be called with one or two arguments; the first is known as the 'CID hint', the second as
|
68 | 'options'.
|
69 |
|
70 | * The CID hint may be a number or a text; if it is a number, it is understood as a CID; if it
|
71 | is a text, its interpretation is subject to the `options[ 'mode' ]` setting. If it is a string, it will
|
72 | be scrutinized for its first character (according to `mode`); the rest of the text will be ignored.
|
73 |
|
74 | * `options` must be a Plain Old Dictionary (a JS object) with the optional members `input`, `output`, and
|
75 | `csg`.
|
76 |
|
77 | * `options[ 'input' ]` is *only* observed if the CID hint is a text; it governs which kinds of character
|
78 | references are recognized in the text. `input` may be one of `plain`, `ncr`, or `xncr`; it defaults to
|
79 | `plain` (no character references will be recognized).
|
80 |
|
81 | * `options[ 'csg' ]` sets the character set sigil. If `csg` is set in the options, then it will override
|
82 | whatever the outcome of `CHR.csg_cid_from_chr` w.r.t. CSG is—in other words, if you call
|
83 | `CHR.as_sfncr '&jzr#xe100', mode: 'xncr', csg: 'u'`, you will get `u-e100`, with the numerically
|
84 | equivalent codepoint from the `u` (Unicode) character set.
|
85 |
|
86 | * Before CSG and CID are returned, they will be validated for plausibility.
|
87 |
|
88 | ## Public Members
|
89 |
|
90 | ### `analyze = ( cid_hint, options ) ->`
|
91 |
|
92 | The many-tricks-pony of `coffeenode-chr`. It will return an
|
93 | object describing multiple aspects of the codepoint in question. Examples:
|
94 |
|
95 | ```coffeescript
|
96 | # CHR.analyze 'helo world'
|
97 |
|
98 | { chr: 'h', # The first character of the text.
|
99 | csg: 'u', # CSG 'u', i.e. a Unicode character.
|
100 | cid: 104, # Codepoint: 104 = 0x68.
|
101 | fncr: 'u-latn-68', # The 'friendly NCR'; see 'rsg', below.
|
102 | sfncr: 'u-68', # The 'short friendly NCR'.
|
103 | ncr: 'h', # HTML-comaptible Numeric Character Reference (NCR).
|
104 | xncr: 'h', # For codepoints outside of Unicode, this will differ from the NCR.
|
105 | rsg: 'u-latn' } # The 'range sigil', i.e. Unicode block identifier.
|
106 | ````
|
107 |
|
108 | When using Numerical Character References (NCRs), it is important to choose the right input mode (namely,
|
109 | `ncr` or `xncr`):
|
110 |
|
111 | ```coffeescript
|
112 | # CHR.analyze '𤕣' # same as `CHR.analyze '𤕣', input: 'plain'`
|
113 |
|
114 | { chr: '&',
|
115 | csg: 'u',
|
116 | cid: 38,
|
117 | fncr: 'u-latn-26',
|
118 | sfncr: 'u-26',
|
119 | ncr: '&',
|
120 | xncr: '&',
|
121 | rsg: 'u-latn' }
|
122 |
|
123 | # CHR.analyze '𤕣', input: 'ncr'
|
124 |
|
125 | { chr: '𤕣',
|
126 | csg: 'u',
|
127 | cid: 148835,
|
128 | fncr: 'u-cjk-xb-24563',
|
129 | sfncr: 'u-24563',
|
130 | ncr: '𤕣',
|
131 | xncr: '𤕣',
|
132 | rsg: 'u-cjk-xb' }
|
133 |
|
134 | # CHR.analyze '𤕣', input: 'xncr'
|
135 |
|
136 | { chr: '𤕣',
|
137 | csg: 'u',
|
138 | cid: 148835,
|
139 | fncr: 'u-cjk-xb-24563',
|
140 | sfncr: 'u-24563',
|
141 | ncr: '𤕣',
|
142 | xncr: '𤕣',
|
143 | rsg: 'u-cjk-xb' }
|
144 | ````
|
145 |
|
146 | In the above examples, the NCR `𤕣` was successfully decoded in modes input `ncr` and `xncr`, while in
|
147 | `plain` mode, `&` counts as the first character of the text.
|
148 |
|
149 | Two more examples to show how to reference characters outside of Unicode: First let's analyze an 'extended
|
150 | NCR' using mode `ncr`. The result is that, since `&jzr#x24563;` violates the rules for NCRs, it is not
|
151 | recognized and treated as an ordinary text (the same way browsers do it). The first character of that text
|
152 | is an `&` ampersand, so all we get is a description of that character in mode `ncr`:
|
153 |
|
154 | ```coffeescript
|
155 | # CHR.analyze '&jzr#xe100;', input: 'ncr'
|
156 |
|
157 | { chr: '&',
|
158 | csg: 'u',
|
159 | cid: 38,
|
160 | fncr: 'u-latn-26',
|
161 | sfncr: 'u-26',
|
162 | ncr: '&',
|
163 | xncr: '&',
|
164 | rsg: 'u-latn' }
|
165 | ````
|
166 |
|
167 | When we switch to mode `xncr`, the extended NCR is properly detected:
|
168 |
|
169 | ```coffeescript
|
170 | # CHR.analyze '&jzr#xe100;', input: 'xncr'
|
171 |
|
172 | { chr: '&jzr#xe100;', # the XNCR has been recognized.
|
173 | csg: 'jzr', # The CSG identifies the Jizura Character Set (JZRCS).
|
174 | cid: 57600, # CID is 57600 = 0xe100
|
175 | fncr: 'jzr-fig-e100', # The FNCR tells us that the codepoint is in the 'fig' block of the JZRCS.
|
176 | sfncr: 'jzr-e100',
|
177 | ncr: '', # When rendering to a web page, we must use standard NCRs.
|
178 | xncr: '&jzr#xe100;', # In plain texts, databases and so on, we may wish to use this notation.
|
179 | rsg: 'jzr-fig' }
|
180 | ````
|
181 |
|
182 | This information may be used to properly render the codepoint in question, say, as
|
183 | `<span class='jzr-fig'></span>`,
|
184 | alongside with suitable CSS rules that tell the browser which font to use.
|
185 |
|
186 |
|
187 | ### `as_chr = ( cid_hint, options ) ->`
|
188 |
|
189 | ### `as_cid = ( cid_hint, options ) ->`
|
190 |
|
191 | ### `as_csg = ( cid_hint, options ) ->`
|
192 |
|
193 | ### `as_fncr = ( cid_hint, options ) ->`
|
194 |
|
195 | ### `as_ncr = ( cid_hint, options ) ->`
|
196 |
|
197 | ### `as_range_name = ( cid_hint, options ) ->`
|
198 |
|
199 | ### `as_rsg = ( cid_hint, options ) ->`
|
200 |
|
201 | ### `as_sfncr = ( cid_hint, options ) ->`
|
202 |
|
203 | ### `as_xncr = ( cid_hint, options ) ->`
|
204 |
|
205 | ### `chrs_from_text = ( text, options ) ->`
|
206 |
|
207 | Given a `text` and `options`, return a list of characters. The only relevant setting of `options` is `mode`,
|
208 | which governs (as in most methods) whether NCRs/XNCRs are recognized. When applying a CHR method such as
|
209 | `CHR.analyze` to each element of the resulting list (with the same `mode` setting), a complete analysis of
|
210 | a text can be perfomed.
|
211 |
|
212 | ### `chunks_from_text = ( text, options ) ->`
|
213 |
|
214 | Given a `text` and `options` (of which only `mode` is relevant here), return a list of `CHR/chunk`
|
215 | objects (as returned by `CHR._new_chunk`) that describes stretches of characters with codepoints in the
|
216 | same 'range' (Unicode block).
|
217 |
|
218 | ### `html_from_text = ( text, options ) ->`
|
219 |
|
220 | ### `cid_from_chr = ( chr, options ) ->`
|
221 |
|
222 | ### `csg_cid_from_chr = ( chr, options ) ->`
|
223 |
|
224 | ### `validate_is_cid = ( x ) ->`
|
225 |
|
226 | ### `validate_is_csg = ( x ) ->`
|
227 | <!-- **cid\_from\_fncr = ( ) ->**
|
228 |
|
229 | **cid\_from\_ncr = ( ) ->**
|
230 |
|
231 | **cid\_from\_xncr = ( ) ->**
|
232 | -->
|
233 |
|
234 | ## Private Members
|
235 |
|
236 | ```coffeescript
|
237 | _analyze = ( csg, cid ) ->
|
238 | _as_fncr = ( csg, cid ) ->
|
239 | _as_range_name = ( csg, cid ) ->
|
240 | _as_rsg = ( csg, cid ) ->
|
241 | _as_sfncr = ( csg, cid ) ->
|
242 | _as_xncr = ( csg, cid ) ->
|
243 | _chr_csg_cid_from_chr = ( chr, mode ) ->
|
244 | _csg_cid_from_hint = ( cid_hint, options ) ->
|
245 | _names_and_ranges_by_csg = unicode_blocks_data[ 'names-and-ranges-by-csg' ]
|
246 | _unicode_chr_from_cid = ( cid ) ->
|
247 |
|
248 | _csg_matcher
|
249 | _first_chr_matcher_ncr
|
250 | _first_chr_matcher_plain
|
251 | _first_chr_matcher_xncr
|
252 | _ncr_csg_cid_matcher
|
253 | _ncr_matcher
|
254 | _ncr_splitter
|
255 | _nonsurrogate_matcher
|
256 | _plain_splitter
|
257 | _surrogate_matcher
|
258 | _xncr_csg_cid_matcher
|
259 | _xncr_matcher
|
260 | _xncr_splitter
|
261 | ````
|
262 |
|
263 | # Background
|
264 |
|
265 | <!-- ## Unicode characters & codepoints -->
|
266 |
|
267 | ## JavaScript & Unicode
|
268 |
|
269 | **When JavaScript was conceived and standardized** in 1994/95, the Unicode standard was still in its infancy.
|
270 | Early design plans for a Universal Character Set had argued that 2<sup>16</sup> or even a mere
|
271 | 2<sup>14</sup> codepoints should be more than sufficient to represent all characters in current use—it was
|
272 | only in 1996 that the Unicode Consortium acknowledged the need for a far bigger number of codepoints, and
|
273 | hence pushed the highest valid codepoint position from `0xffff` to `0x10ffff`; there are, consequently,
|
274 | 1,114,112 codepoints available in Unicode from version 2.0 onwards. In the vernacular, Unicode codepoints
|
275 | beyond `0xffff` (the frontier of the Basic Multilingual Plane, BMP) are known as '32bit characters', and
|
276 | the wilderness out there as 'The Astral Planes'.
|
277 |
|
278 | **In order to remain backwards compatible** to those applications which had been implemented under the
|
279 | premise that each Unicode codepoint would be representable within 16 bits, a **'Surrogate Character'**
|
280 | mechanism was introduced alongside with the extension of the character set. To understand Surrogate
|
281 | Characters, imagine the same had happened to ASCII—an 8bit character set that leaves all codepoints using
|
282 | the 8<sup>th</sup> bit undefined. We could go and make those 128 codepoints surrogate characters,
|
283 | stipulating that the 64 codepoints in the range `0x80`...`0xbf` shall serve as 'High (or Leading)
|
284 | Surrogates', those in the range `0xc0`...`0xff` as 'Low (or Trailing) Surrogates', and that High and Low
|
285 | Surrogates shall always appear as pairs. In essence, then, every surrogate pair `HL` then represents a
|
286 | number written out in a positional system with the base 64 (with the offsets `0x80` and `0xc0`)—that is, we
|
287 | can project it onto codepoints beyond `0xff` (255) by saying that a surrogate pair `HL` is a way to access
|
288 | codepoint `( H — 0x80 ) * 64 + ( L — 0xc0 ) + 255`, allowing us to express around 4000 additional codepoints
|
289 | in the range `0x100`...`0x10fe` from within the confines of an 8bit character set!
|
290 |
|
291 | Of course, such a move would have **broken a lot of assumptions**, such as the Olde Saw that 'a character is
|
292 | a codepoint, a codepoint is a byte', a truism in all early encoding schemes. In Unicode, a 'character' (a
|
293 | unit of writing) is distinct from a 'glyph' (the graphical appearance of a character), and while each
|
294 | 'glyph' is mapped to a 'codepoint', many virtually indistinguishable glyphs may be mapped to disparate
|
295 | codepoints (for a number of reasons—some good, some bad, some historically justified, some mistaken).
|
296 |
|
297 | When working with Unicode, it is important to be aware of the fact that **only up to 256 Unicode codepoints
|
298 | are maximally representable within a 'byte'—a scant 0.256% of the ≈100,000 codepoints in use** (as of
|
299 | Unicode 6. When using UTF-8 as an encoding, there are actually a mere 128 codepoints left that occupy
|
300 | one byte only). **Understanding the codepoint / character / byte schism is essential for any programmer
|
301 | wanting to process text**.
|
302 |
|
303 | Now **JavaScript *was* fairly advanced for its time** in that its text data type—the `String`—was defined in
|
304 | terms of Unicode characters at a time when the programming community at large was still happily hacking
|
305 | bytes, and web designers spit out HTML pages using ISO 8859-1 (by comparison, Python, which was
|
306 | conceived in 1991, got a Unicode data type only in 2000, and abolished its 8bit string type in 2008—until
|
307 | then, a good many Python programmers operated on bytes rather than codepoints when manipulating text in that
|
308 | language).
|
309 |
|
310 | Since JavaScript's `String` data type is character-oriented and Array-like, it is very convenient to, say,
|
311 | iterate over characters in a text. It is as simple as fetching the length of the string and iterate over
|
312 | its indexes:
|
313 |
|
314 | var chr_count = text.length;
|
315 | for ( var idx = 0; idx < chr_count; idx += 1 ) {
|
316 | log( text[ idx ] ); }
|
317 |
|
318 | Where JavaScript is something of **a bit of a failure** is its treatment of those elusive 32bit / astral
|
319 | codepoints. Sadly, for all its simplicity and correctness in 95% of all practically occurring situations
|
320 | in day-to-day programming situations, the above code snippets is **not correct**, as it fails to account
|
321 | for the fact that **in JavaScript, 32bit characters occupy two `String` positions**. We can easily test this:
|
322 |
|
323 | var show_codepoints = function( text ) {
|
324 | var chr;
|
325 | var chr_count = text.length;
|
326 | for ( var idx = 0; idx < chr_count; idx += 1 ) {
|
327 | chr = text[ idx ];
|
328 | log( chr, idx, '0x' + chr.charCodeAt( 0 ).toString( 16 ) ); } };
|
329 |
|
330 | var text = '自強不息';
|
331 | show_codepoints( text );
|
332 |
|
333 | gives
|
334 |
|
335 | 自 0 0x81ea
|
336 | 強 1 0x5f37
|
337 | 不 2 0x4e0d
|
338 | 息 3 0x606f
|
339 |
|
340 | which is correct, since all codepoints are below `0xffff`. However, should a text inadvertently include
|
341 | astral entities, the algorithm breaks down in nasty ways:
|
342 |
|
343 | var text = '𤕣古文龜';
|
344 | show_codepoints( text );
|
345 |
|
346 | gives
|
347 |
|
348 | � 0 0xd851
|
349 | � 1 0xdd63
|
350 | 古 2 0x53e4
|
351 | 文 3 0x6587
|
352 | 龜 4 0x9f9c
|
353 |
|
354 | Like in an X-ray, we see that the string now holds *five* positions, in spite of there being only *four*
|
355 | characters, as in the previous one. The first two units are rendered with the Unicode Replacement
|
356 | Character (which itself—confusingly—has a codepoint, `0xffd`; in essence, all codepoints deemed illegal are
|
357 | mapped to `0xfffd` at some point on their way through the output pipeline), as they are *no legal
|
358 | codepoints* when occurring without a suitable partner.
|
359 |
|
360 | > The astute reader will notice a conundrum here: we've just proven that the character `𤕣` is stored as two
|
361 | > units in JavaScript, rather than one. Still, we managed to get it from an UTF-8 encoded sourcefile
|
362 | > through the runtime and back out onto the console correctly—shouldn't JavaScript's treatment have
|
363 | > destroyed the character by splitting it up? The answer is: Yes, that should have happened, and it is
|
364 | > indeed what did happen in NodeJS versions prior to 7.7 (i believe), and also in earlier versions of
|
365 | > Chrome. Fortunately, there has been fixed by making it so that input and output routines perform
|
366 | > 'ensurrogating' and 'desurrogating' on the character streams, with the effect that Joe the Programmer has
|
367 | > one problem less now.
|
368 |
|
369 | However, the numbers shown (`0xd851` and `0xdd63`) indicate that JavaScript *can* deal with those positions,
|
370 | it just cannot print out a suitable glyph for them. Indeed, when we apply the formula for Unicode Surrogate
|
371 | Pair conversion to the first two codepoints in the critical text:
|
372 |
|
373 | H = 0xd851
|
374 | L = 0xdd63
|
375 | codepoint = ( H - 0xD800 ) * 0x400 + ( L - 0xDC00 ) + 0x10000
|
376 |
|
377 | we find that the result `0x24563` does point to 𤕣 (an archaic variant of the modern Chinese character 龜). So
|
378 | it becomes clear that JavaScript *can* deal with 'astral texts' (rendering it in what has become known as
|
379 | *mojibake* 文字化け or *krüsel-krüsel*)—provided that programmers do respect surrogates.
|
380 |
|
381 | This leaves code wranglers with a truly confusing and sometimes frustrating situation: First, we had to
|
382 | swallow that **a byte is not a character** (anyone who spent their time between 2000 and 2010 trying to
|
383 | 'make everything work' in Python versions that had *both* an 8bit `str` type *and* a 16bit `Unicode` type
|
384 | will know what i'm talking about). Next, we have to swallow that (in Java, in JavaScript, and in some
|
385 | versions of Python, depending on compilation flags) even **a single character may or may not be a single
|
386 | codepoint**; instead, **characters with codepoints above `0xffff` are represented as two 'Code Units'**, one
|
387 | more term to learn here.
|
388 |
|
389 | <!--
|
390 | ## UTF-8 & CESU-8
|
391 |
|
392 | UTF-8 (an abbreviation for 'Unified Character Set Transformation Format, 8-bit') was invented in 1992 to
|
393 | solve the problem of transmitting and storing Unicode texts using the available octet-(byte-)based file and
|
394 | wire formats of the time. It has since become the undisputed standard encoding for web pages, URLs, and
|
395 | text files.
|
396 |
|
397 | Since codepoints are represented as positive integers, and octets (bytes) can be used to represent numbers
|
398 | between zero and 255, the problem that UTF-8 attempts to solve boils down to representing 'big' integers
|
399 | (those greater than 255) as series of octets.
|
400 |
|
401 | I won't go into detail here—there's a good (Wikipedia)[http://en.wikipedia.org/wiki/UTF-8] article
|
402 | on how UTF-8 mangles bits. The one important thing to understand when it comes to JavaScript and UTF-8
|
403 | is that
|
404 |
|
405 |
|
406 | -->
|
407 |
|
408 | # Glossary
|
409 |
|
410 | ## Character Representations
|
411 |
|
412 | It is often necessary to convert between different character representations. For example, the character `強`
|
413 | may be represented as `強` in HTML—this can help when the source text must be edited in a
|
414 | Unicode-unfriendly environment. Likewise, the Unicode Consortium identifies codepoints using the `U+`
|
415 | notation, e.g. `U+5F37;`.
|
416 |
|
417 | ### Numeric Character Representation (NCR)
|
418 |
|
419 | The Numeric(al) Character Representation (NCR) format was invented to represent 'difficult' characters in
|
420 | SGML, a markup format designed in 1960s which became the ancestor of both XML and HTML. An NCR consists of
|
421 | an ampersand `&` followed by a hash `#`, followed by the numerical codepoint identifier (CID,
|
422 | otherwise known as simple 'a codepoint') of the character in question, and closed with a semicolon `;`.
|
423 | The CID may be written out in decimal or hexadecimal, upper- or lowercase, and with optional leading zeros.
|
424 | If written in hexadecimal, an `x` must be placed between the hash `#` and the CID: thus, `ਵ`, named
|
425 | GURMUKHI LETTER VA, may be represented as `ਵ`, `ਵ`, `ਵ`, `ਵ`, `ਵ`. (The
|
426 | flexibility of these rules and the plethora of possible variants is somewhat of a hallmark of earlier
|
427 | computing standards; other examples for this phenomenon are email- and IP-addresses.)
|
428 |
|
429 | ### Unicode Character Representation (UCR)
|
430 |
|
431 | The Unicode Consortium's Character Representation (UCR) format is used by the Unicode Consortium in its
|
432 | publications. It consists of an uppercase `U`, followed by a plus sign `+`, followed by the CID of
|
433 | the character in question. The CID is invariably written out in uppercase hexadecimal; it is padded with
|
434 | zeros when shorter than four digits; otherwise, it consists of five or six digits as needed. For example,
|
435 | `ਵ` is represented as `U+0A35`.
|
436 |
|
437 | ### Friendly NCRs
|
438 |
|
439 | When having to reference and identify characters in explanatory texts, i personally like to write out the
|
440 | codepoint in a fashion that is both somewhat less 'formal' than NCRs and somewhat more readable, flexible
|
441 | and informative than UCRs; i call this format FNCR for Friendly Numeric Characer Representation. It is
|
442 | mainly intended for use in publications such as character references, where a notice should be made for the
|
443 | reader how to decode the constituent parts of the notation.
|
444 |
|
445 | The general format of an FNCR is as follows: first, the character set is indicated by a short string of
|
446 | letters, `u` being reserved for Unicode; this part is called the Character Set siGil (CSG). Then, the
|
447 | relevant subset or block of the position of the codepoint in the character set is identified by a so-called
|
448 | Range SiGil (RSG). Last, the CID is written out in lowercase hexadecimal. The parts of the FNCR are joined
|
449 | by hyphens `-`. Here are a few examples:
|
450 |
|
451 | ਵ u-guru-a35 # guru: (ISO code for) 'Gurmukhi'
|
452 | 強 u-cjk-5f37 # cjk: Unicode block 'CJK Unified Ideographs'
|
453 | 𤕣 u-cjkxb-24563 # cjkxb: Unicode block 'CJK Ideograph Extension B'
|
454 | € u-cur-20ac # cur: Unicode block 'Currency Symbols'
|
455 |
|
456 | RSGs are important for big character sets such as Unicode, where tens of thousands of characters are
|
457 | distributed over hundred of blocks—it is easy to loose orientation. Since FNCRs include a character set
|
458 | sigil, codepoints from multiple character sets may be identified; for example, here we use `l9` to stand for
|
459 | Latin-9, otherwise known as ISO 8859-15, and `cp1252` for Windows Codepage 1252:
|
460 |
|
461 | € = u-cur-20ac
|
462 | = l9-a4
|
463 | = cp1252-80
|
464 |
|
465 | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
466 |
|
467 | ### Extended NCRs
|
468 |
|
469 | ## Other Terms
|
470 |
|
471 | ### Character Identifier (CID)
|
472 |
|
473 | Probably differs from the concept as used by [Adobe](http://en.wikipedia.org/wiki/PostScript_fonts#CID)
|
474 |
|
475 | ### Codepoint (Codeposition)
|
476 |
|
477 | ### Codeunit
|
478 |
|
479 | ### Character Set Sigil (CSG)
|
480 |
|
481 |
|
482 |
|
483 |
|
\ | No newline at end of file |