1 | # chardet
|
2 |
|
3 | _Chardet_ is a character detection module written in pure JavaScript (TypeScript). Module uses occurrence analysis to determine the most probable encoding.
|
4 |
|
5 | - Packed size is only **22 KB**
|
6 | - Works in all environments: Node / Browser / Native
|
7 | - Works on all platforms: Linux / Mac / Windows
|
8 | - No dependencies
|
9 | - No native code / bindings
|
10 | - 100% written in TypeScript
|
11 | - Extensive code coverage
|
12 |
|
13 | ## Installation
|
14 |
|
15 | ```
|
16 | npm i chardet
|
17 | ```
|
18 |
|
19 | ## Usage
|
20 |
|
21 | To return the encoding with the highest confidence:
|
22 |
|
23 | ```javascript
|
24 | import chardet from 'chardet';
|
25 |
|
26 | const encoding = chardet.detect(Buffer.from('hello there!'));
|
27 | // or
|
28 | const encoding = await chardet.detectFile('/path/to/file');
|
29 | // or
|
30 | const encoding = chardet.detectFileSync('/path/to/file');
|
31 | ```
|
32 |
|
33 | To return the full list of possible encodings use `analyse` method.
|
34 |
|
35 | ```javascript
|
36 | import chardet from 'chardet';
|
37 | chardet.analyse(Buffer.from('hello there!'));
|
38 | ```
|
39 |
|
40 | Returned value is an array of objects sorted by confidence value in descending order
|
41 |
|
42 | ```javascript
|
43 | [
|
44 | { confidence: 90, name: 'UTF-8' },
|
45 | { confidence: 20, name: 'windows-1252', lang: 'fr' },
|
46 | ];
|
47 | ```
|
48 |
|
49 | In browser, you can use [Uint8Array](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array) instead of the `Buffer`:
|
50 |
|
51 | ```javascript
|
52 | import chardet from 'chardet';
|
53 | chardet.analyse(new Uint8Array([0x68, 0x65, 0x6c, 0x6c, 0x6f]));
|
54 | ```
|
55 |
|
56 | ## Working with large data sets
|
57 |
|
58 | Sometimes, when data set is huge and you want to optimize performance (with a tradeoff of less accuracy),
|
59 | you can sample only the first N bytes of the buffer:
|
60 |
|
61 | ```javascript
|
62 | chardet
|
63 | .detectFile('/path/to/file', { sampleSize: 32 })
|
64 | .then((encoding) => console.log(encoding));
|
65 | ```
|
66 |
|
67 | You can also specify where to begin reading from in the buffer:
|
68 |
|
69 | ```javascript
|
70 | chardet
|
71 | .detectFile('/path/to/file', { sampleSize: 32, offset: 128 })
|
72 | .then((encoding) => console.log(encoding));
|
73 | ```
|
74 |
|
75 | ## Supported Encodings:
|
76 |
|
77 | - UTF-8
|
78 | - UTF-16 LE
|
79 | - UTF-16 BE
|
80 | - UTF-32 LE
|
81 | - UTF-32 BE
|
82 | - ISO-2022-JP
|
83 | - ISO-2022-KR
|
84 | - ISO-2022-CN
|
85 | - Shift_JIS
|
86 | - Big5
|
87 | - EUC-JP
|
88 | - EUC-KR
|
89 | - GB18030
|
90 | - ISO-8859-1
|
91 | - ISO-8859-2
|
92 | - ISO-8859-5
|
93 | - ISO-8859-6
|
94 | - ISO-8859-7
|
95 | - ISO-8859-8
|
96 | - ISO-8859-9
|
97 | - windows-1250
|
98 | - windows-1251
|
99 | - windows-1252
|
100 | - windows-1253
|
101 | - windows-1254
|
102 | - windows-1255
|
103 | - windows-1256
|
104 | - KOI8-R
|
105 |
|
106 | Currently only these encodings are supported.
|
107 |
|
108 | ## TypeScript?
|
109 |
|
110 | Yes. Type definitions are included.
|
111 |
|
112 | ### References
|
113 |
|
114 | - ICU project http://site.icu-project.org/
|