1 | # chardet
|
2 |
|
3 | *Chardet* is a character detection module written in pure Javascript (Typescript). Module uses occurrence analysis to determine the most probable encoding.
|
4 |
|
5 | - Packed size is only **22 KB**
|
6 | - Works in all environments: Node / Browser / Native
|
7 | - Works on all platforms: Linux / Mac / Windows
|
8 | - No dependencies
|
9 | - No native code / bindings
|
10 | - 100% written in Typescript
|
11 | - Extensive code coverage
|
12 |
|
13 | ## Installation
|
14 |
|
15 | ```
|
16 | npm i chardet
|
17 | ```
|
18 |
|
19 | ## Usage
|
20 |
|
21 | To return the encoding with the highest confidence:
|
22 |
|
23 | ```javascript
|
24 | const chardet = require('chardet');
|
25 |
|
26 | chardet.detect(Buffer.from('hello there!'));
|
27 | // or
|
28 | chardet.detectFile('/path/to/file').then(encoding => console.log(encoding));
|
29 | // or
|
30 | chardet.detectFileSync('/path/to/file');
|
31 | ```
|
32 |
|
33 | To return the full list of possible encodings use `analyse` method.
|
34 |
|
35 | ```javascript
|
36 | const chardet = require('chardet');
|
37 | chardet.analyse(Buffer.from('hello there!'));
|
38 | ```
|
39 |
|
40 | Returned value is an array of objects sorted by confidence value in decending order
|
41 |
|
42 | ```javascript
|
43 | [
|
44 | { confidence: 90, name: 'UTF-8' },
|
45 | { confidence: 20, name: 'windows-1252', lang: 'fr' }
|
46 | ];
|
47 | ```
|
48 |
|
49 | ## Working with large data sets
|
50 |
|
51 | Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy),
|
52 | you can sample only first N bytes of the buffer:
|
53 |
|
54 | ```javascript
|
55 | chardet
|
56 | .detectFile('/path/to/file', { sampleSize: 32 })
|
57 | .then(encoding => console.log(encoding));
|
58 | ```
|
59 |
|
60 | ## Supported Encodings:
|
61 |
|
62 | - UTF-8
|
63 | - UTF-16 LE
|
64 | - UTF-16 BE
|
65 | - UTF-32 LE
|
66 | - UTF-32 BE
|
67 | - ISO-2022-JP
|
68 | - ISO-2022-KR
|
69 | - ISO-2022-CN
|
70 | - Shift_JIS
|
71 | - Big5
|
72 | - EUC-JP
|
73 | - EUC-KR
|
74 | - GB18030
|
75 | - ISO-8859-1
|
76 | - ISO-8859-2
|
77 | - ISO-8859-5
|
78 | - ISO-8859-6
|
79 | - ISO-8859-7
|
80 | - ISO-8859-8
|
81 | - ISO-8859-9
|
82 | - windows-1250
|
83 | - windows-1251
|
84 | - windows-1252
|
85 | - windows-1253
|
86 | - windows-1254
|
87 | - windows-1255
|
88 | - windows-1256
|
89 | - KOI8-R
|
90 |
|
91 | Currently only these encodings are supported.
|
92 |
|
93 | ## Typescript?
|
94 |
|
95 | Yes. Type definitions are included.
|
96 |
|
97 | ### References
|
98 |
|
99 | - ICU project http://site.icu-project.org/
|