1 |
2 | # BudouX JavaScript module
3 |
4 | BudouX is a standalone, small, and language-neutral phrase segmenter tool that
5 | provides beautiful and legible line breaks.
6 |
7 | For more details about the project, please refer to the [project README](https://github.com/google/budoux/).
8 |
9 | ## Demo
10 |
11 | <https://google.github.io/budoux>
12 |
13 | ## Install
14 |
15 | ```shellsession
16 | $ npm install budoux
17 | ```
18 |
19 | ## Usage
20 |
21 | ### Simple usage
22 |
23 | You can get a list of phrases by feeding a sentence to the parser.
24 | The easiest way is to get a parser is loading the default parser for each language.
25 |
26 | **Japanese:**
27 |
28 | ```javascript
29 | import { loadDefaultJapaneseParser } from 'budoux';
30 | const parser = loadDefaultJapaneseParser();
31 | console.log(parser.parse('今日は天気です。'));
32 | // ['今日は', '天気です。']
33 | ```
34 |
35 | **Simplified Chinese:**
36 |
37 | ```javascript
38 | import { loadDefaultSimplifiedChineseParser } from 'budoux';
39 | const parser = loadDefaultSimplifiedChineseParser();
40 | console.log(parser.parse('是今天的天气。'));
41 | // ['是', '今天', '的', '天气。']
42 | ```
43 |
44 | **Traditional Chinese:**
45 |
46 | ```javascript
47 | import { loadDefaultTraditionalChineseParser } from 'budoux';
48 | const parser = loadDefaultTraditionalChineseParser();
49 | console.log(parser.parse('是今天的天氣。'));
50 | // ['是', '今天', '的', '天氣。']
51 | ```
52 |
53 | **Thai:**
54 |
55 | ```javascript
56 | import { loadDefaultThaiParser } from 'budoux';
57 | const parser = loadDefaultThaiParser();
58 | console.log(parser.parse('วันนี้อากาศดี'));
59 | // ['วัน', 'นี้', 'อากาศ', 'ดี']
60 | ```
61 |
62 | ### Translating an HTML string
63 |
64 | You can also translate an HTML string to wrap phrases with non-breaking markup,
65 | specifically, zero-width spaces (U+200B).
66 |
67 | ```javascript
68 | console.log(parser.translateHTMLString('今日は<b>とても天気</b>です。'));
69 | // <span style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</span>
70 | ```
71 |
72 | Please note that separators are denoted as `\u200b` in the example above for
73 | illustrative purposes, but the actual output is an invisible string as it's a
74 | zero-width space.
75 |
76 | ### Applying to an HTML element
77 |
78 | You can also feed an HTML element to the parser to apply the process.
79 |
80 | ```javascript
81 | const ele = document.querySelector('p.budou-this');
82 | console.log(ele.outerHTML);
83 | // <p class="budou-this">今日は<b>とても天気</b>です。</p>
84 | parser.applyToElement(ele);
85 | console.log(ele.outerHTML);
86 | // <p class="budou-this" style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</p>
87 | ```
88 |
89 | Internally, the `applyToElement` calls the [`HTMLProcessor`]'s `applyToElement`
90 | function with the zero-width space as the separator.
91 |
92 | You can use the [`HTMLProcessor`] class directly if desired.
93 | For example:
94 |
95 | ```javascript
96 | import { HTMLProcessor } from 'budoux';
97 | const ele = document.querySelector('p.budou-this');
98 | const htmlProcessor = new HTMLProcessor(parser, {
99 | separator: ' '
100 | });
101 | htmlProcessor.applyToElement(ele);
102 | ```
103 |
104 | [`HTMLProcessor`]: https://github.com/google/budoux/blob/main/javascript/src/html_processor.ts
105 |
106 | ### Loading a custom model
107 |
108 | You can load your own custom model as follows.
109 |
110 | ```javascript
111 | import { Parser } from 'budoux';
112 | const model = JSON.parse('{"UW4": {"a": 133}}'); // Content of the custom model JSON file.
113 | const parser = new Parser(model);
114 | parser.parse('xyzabc'); // ['xyz', 'abc']
115 | ```
116 |
117 | ## Web components
118 |
119 | BudouX also offers Web components to integrate the parser with your website quickly.
120 | All you have to do is wrap sentences with:
121 |
122 | - `<budoux-ja>` for Japanese
123 | - `<budoux-zh-hans>` for Simplified Chinese
124 | - `<budoux-zh-hant>` for Traditional Chinese
125 | - `<budoux-th>` for Thai
126 |
127 | ```html
128 | <budoux-ja>今日は天気です。</budoux-ja>
129 | <budoux-zh-hans>今天是晴天。</budoux-zh-hans>
130 | <budoux-zh-hant>今天是晴天。</budoux-zh-hant>
131 | <budoux-th>วันนี้อากาศดี</budoux-th>
132 | ```
133 |
134 | In order to enable the custom element, you can simply add this line to load the bundle.
135 |
136 | ```html
137 | <!-- For Japanese -->
138 | <script src="https://unpkg.com/budoux/bundle/budoux-ja.min.js"></script>
139 |
140 | <!-- For Simplified Chinese -->
141 | <script src="https://unpkg.com/budoux/bundle/budoux-zh-hans.min.js"></script>
142 |
143 | <!-- For Traditional Chinese -->
144 | <script src="https://unpkg.com/budoux/bundle/budoux-zh-hant.min.js"></script>
145 |
146 | <!-- For Thai -->
147 | <script src="https://unpkg.com/budoux/bundle/budoux-th.min.js"></script>
148 | ```
149 |
150 | Otherwise, if you wish to bundle the component with the rest of your source code,
151 | you can import the component as shown below.
152 |
153 | ```javascript
154 | // For Japanese
155 | import 'budoux/module/webcomponents/budoux-ja';
156 |
157 | // For Simplified Chinese
158 | import 'budoux/module/webcomponents/budoux-zh-hans';
159 |
160 | // For Traditional Chinese
161 | import 'budoux/module/webcomponents/budoux-zh-hant';
162 |
163 | // For Thai
164 | import 'budoux/module/webcomponents/budoux-th';
165 | ```
166 |
167 | ### CLI
168 |
169 | You can also format inputs on your terminal with `budoux` command.
170 |
171 | ```shellsession
172 | $ budoux 本日は晴天です。
173 | 本日は
174 | 晴天です。
175 | ```
176 |
177 | ```shellsession
178 | $ echo $'本日は晴天です。\n明日は曇りでしょう。' | budoux
179 | 本日は
180 | 晴天です。
181 | ---
182 | 明日は
183 | 曇りでしょう。
184 | ```
185 |
186 | ```shellsession
187 | $ budoux 本日は晴天です。 -H
188 | <span style="word-break: keep-all; overflow-wrap: anywhere;">本日は\u200b晴天です。</span>
189 | ```
190 |
191 | Please note that separators are denoted as `\u200b` in the example above for
192 | illustrative purposes, but the actual output is an invisible string as it's a
193 | zero-width space.
194 |
195 | If you want to see help, run `budoux -h`.
196 |
197 | ```shellsession
198 | $ budoux -h
199 | Usage: budoux [-h] [-H] [-d STR] [-m JSON] [-V] [TXT]
200 |
201 | BudouX is the successor to Budou, the machine learning powered line break organizer tool.
202 |
203 | Arguments:
204 | txt text
205 |
206 | Options:
207 | -H, --html HTML mode (default: false)
208 | -d, --delim <str> output delimiter in TEXT mode (default: "---")
209 | -m, --model <json> custom model file path
210 | -V, --version output the version number
211 | -h, --help display help for command
212 | ```
213 |
214 | ## Caveat
215 |
216 | BudouX supports HTML inputs and outputs HTML strings with markup applied to wrap
217 | phrases, but it's not meant to be used as an HTML sanitizer.
218 | **BudouX doesn't sanitize any inputs.**
219 | Malicious HTML inputs yield malicious HTML outputs.
220 | Please use it with an appropriate sanitizer library if you don't trust the input.
221 |
222 | ## Author
223 |
224 | [Shuhei Iitsuka](https://tushuhei.com)
225 |
226 | ## Disclaimer
227 |
228 | This is not an officially supported Google product.