1 |
|
2 | # BudouX JavaScript module
|
3 |
|
4 | BudouX is a standalone, small, and language-neutral phrase segmenter tool that
|
5 | provides beautiful and legible line breaks.
|
6 |
|
7 | For more details about the project, please refer to the [project README](https://github.com/google/budoux/).
|
8 |
|
9 | ## Demo
|
10 |
|
11 | <https://google.github.io/budoux>
|
12 |
|
13 | ## Install
|
14 |
|
15 | ```shellsession
|
16 | $ npm install budoux
|
17 | ```
|
18 |
|
19 | ## Usage
|
20 |
|
21 | ### Simple usage
|
22 |
|
23 | You can get a list of phrases by feeding a sentence to the parser.
|
24 | The easiest way is to get a parser is loading the default parser for each language.
|
25 |
|
26 | **Japanese:**
|
27 |
|
28 | ```javascript
|
29 | import { loadDefaultJapaneseParser } from 'budoux';
|
30 | const parser = loadDefaultJapaneseParser();
|
31 | console.log(parser.parse('今日は天気です。'));
|
32 | // ['今日は', '天気です。']
|
33 | ```
|
34 |
|
35 | **Simplified Chinese:**
|
36 |
|
37 | ```javascript
|
38 | import { loadDefaultSimplifiedChineseParser } from 'budoux';
|
39 | const parser = loadDefaultSimplifiedChineseParser();
|
40 | console.log(parser.parse('是今天的天气。'));
|
41 | // ['是', '今天', '的', '天气。']
|
42 | ```
|
43 |
|
44 | **Traditional Chinese:**
|
45 |
|
46 | ```javascript
|
47 | import { loadDefaultTraditionalChineseParser } from 'budoux';
|
48 | const parser = loadDefaultTraditionalChineseParser();
|
49 | console.log(parser.parse('是今天的天氣。'));
|
50 | // ['是', '今天', '的', '天氣。']
|
51 | ```
|
52 |
|
53 | **Thai:**
|
54 |
|
55 | ```javascript
|
56 | import { loadDefaultThaiParser } from 'budoux';
|
57 | const parser = loadDefaultThaiParser();
|
58 | console.log(parser.parse('วันนี้อากาศดี'));
|
59 | // ['วัน', 'นี้', 'อากาศ', 'ดี']
|
60 | ```
|
61 |
|
62 | ### Translating an HTML string
|
63 |
|
64 | You can also translate an HTML string to wrap phrases with non-breaking markup,
|
65 | specifically, zero-width spaces (U+200B).
|
66 |
|
67 | ```javascript
|
68 | console.log(parser.translateHTMLString('今日は<b>とても天気</b>です。'));
|
69 | // <span style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</span>
|
70 | ```
|
71 |
|
72 | Please note that separators are denoted as `\u200b` in the example above for
|
73 | illustrative purposes, but the actual output is an invisible string as it's a
|
74 | zero-width space.
|
75 |
|
76 | ### Applying to an HTML element
|
77 |
|
78 | You can also feed an HTML element to the parser to apply the process.
|
79 |
|
80 | ```javascript
|
81 | const ele = document.querySelector('p.budou-this');
|
82 | console.log(ele.outerHTML);
|
83 | // <p class="budou-this">今日は<b>とても天気</b>です。</p>
|
84 | parser.applyToElement(ele);
|
85 | console.log(ele.outerHTML);
|
86 | // <p class="budou-this" style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</p>
|
87 | ```
|
88 |
|
89 | Internally, the `applyToElement` calls the [`HTMLProcessor`]'s `applyToElement`
|
90 | function with the zero-width space as the separator.
|
91 |
|
92 | You can use the [`HTMLProcessor`] class directly if desired.
|
93 | For example:
|
94 |
|
95 | ```javascript
|
96 | import { HTMLProcessor } from 'budoux';
|
97 | const ele = document.querySelector('p.budou-this');
|
98 | const htmlProcessor = new HTMLProcessor(parser, {
|
99 | separator: ' '
|
100 | });
|
101 | htmlProcessor.applyToElement(ele);
|
102 | ```
|
103 |
|
104 | [`HTMLProcessor`]: https://github.com/google/budoux/blob/main/javascript/src/html_processor.ts
|
105 |
|
106 | ### Loading a custom model
|
107 |
|
108 | You can load your own custom model as follows.
|
109 |
|
110 | ```javascript
|
111 | import { Parser } from 'budoux';
|
112 | const model = JSON.parse('{"UW4": {"a": 133}}'); // Content of the custom model JSON file.
|
113 | const parser = new Parser(model);
|
114 | parser.parse('xyzabc'); // ['xyz', 'abc']
|
115 | ```
|
116 |
|
117 | ## Web components
|
118 |
|
119 | BudouX also offers Web components to integrate the parser with your website quickly.
|
120 | All you have to do is wrap sentences with:
|
121 |
|
122 | - `<budoux-ja>` for Japanese
|
123 | - `<budoux-zh-hans>` for Simplified Chinese
|
124 | - `<budoux-zh-hant>` for Traditional Chinese
|
125 | - `<budoux-th>` for Thai
|
126 |
|
127 | ```html
|
128 | <budoux-ja>今日は天気です。</budoux-ja>
|
129 | <budoux-zh-hans>今天是晴天。</budoux-zh-hans>
|
130 | <budoux-zh-hant>今天是晴天。</budoux-zh-hant>
|
131 | <budoux-th>วันนี้อากาศดี</budoux-th>
|
132 | ```
|
133 |
|
134 | In order to enable the custom element, you can simply add this line to load the bundle.
|
135 |
|
136 | ```html
|
137 | <!-- For Japanese -->
|
138 | <script src="https://unpkg.com/budoux/bundle/budoux-ja.min.js"></script>
|
139 |
|
140 | <!-- For Simplified Chinese -->
|
141 | <script src="https://unpkg.com/budoux/bundle/budoux-zh-hans.min.js"></script>
|
142 |
|
143 | <!-- For Traditional Chinese -->
|
144 | <script src="https://unpkg.com/budoux/bundle/budoux-zh-hant.min.js"></script>
|
145 |
|
146 | <!-- For Thai -->
|
147 | <script src="https://unpkg.com/budoux/bundle/budoux-th.min.js"></script>
|
148 | ```
|
149 |
|
150 | Otherwise, if you wish to bundle the component with the rest of your source code,
|
151 | you can import the component as shown below.
|
152 |
|
153 | ```javascript
|
154 | // For Japanese
|
155 | import 'budoux/module/webcomponents/budoux-ja';
|
156 |
|
157 | // For Simplified Chinese
|
158 | import 'budoux/module/webcomponents/budoux-zh-hans';
|
159 |
|
160 | // For Traditional Chinese
|
161 | import 'budoux/module/webcomponents/budoux-zh-hant';
|
162 |
|
163 | // For Thai
|
164 | import 'budoux/module/webcomponents/budoux-th';
|
165 | ```
|
166 |
|
167 | ### CLI
|
168 |
|
169 | You can also format inputs on your terminal with `budoux` command.
|
170 |
|
171 | ```shellsession
|
172 | $ budoux 本日は晴天です。
|
173 | 本日は
|
174 | 晴天です。
|
175 | ```
|
176 |
|
177 | ```shellsession
|
178 | $ echo $'本日は晴天です。\n明日は曇りでしょう。' | budoux
|
179 | 本日は
|
180 | 晴天です。
|
181 | ---
|
182 | 明日は
|
183 | 曇りでしょう。
|
184 | ```
|
185 |
|
186 | ```shellsession
|
187 | $ budoux 本日は晴天です。 -H
|
188 | <span style="word-break: keep-all; overflow-wrap: anywhere;">本日は\u200b晴天です。</span>
|
189 | ```
|
190 |
|
191 | Please note that separators are denoted as `\u200b` in the example above for
|
192 | illustrative purposes, but the actual output is an invisible string as it's a
|
193 | zero-width space.
|
194 |
|
195 | If you want to see help, run `budoux -h`.
|
196 |
|
197 | ```shellsession
|
198 | $ budoux -h
|
199 | Usage: budoux [-h] [-H] [-d STR] [-m JSON] [-V] [TXT]
|
200 |
|
201 | BudouX is the successor to Budou, the machine learning powered line break organizer tool.
|
202 |
|
203 | Arguments:
|
204 | txt text
|
205 |
|
206 | Options:
|
207 | -H, --html HTML mode (default: false)
|
208 | -d, --delim <str> output delimiter in TEXT mode (default: "---")
|
209 | -m, --model <json> custom model file path
|
210 | -V, --version output the version number
|
211 | -h, --help display help for command
|
212 | ```
|
213 |
|
214 | ## Caveat
|
215 |
|
216 | BudouX supports HTML inputs and outputs HTML strings with markup applied to wrap
|
217 | phrases, but it's not meant to be used as an HTML sanitizer.
|
218 | **BudouX doesn't sanitize any inputs.**
|
219 | Malicious HTML inputs yield malicious HTML outputs.
|
220 | Please use it with an appropriate sanitizer library if you don't trust the input.
|
221 |
|
222 | ## Author
|
223 |
|
224 | [Shuhei Iitsuka](https://tushuhei.com)
|
225 |
|
226 | ## Disclaimer
|
227 |
|
228 | This is not an officially supported Google product.
|