UNPKG

6.49 kBMarkdownView Raw
1<!-- markdownlint-disable MD014 -->
2# BudouX JavaScript module
3
4BudouX is a standalone, small, and language-neutral phrase segmenter tool that
5provides beautiful and legible line breaks.
6
7For more details about the project, please refer to the [project README](https://github.com/google/budoux/).
8
9## Demo
10
11<https://google.github.io/budoux>
12
13## Install
14
15```shellsession
16$ npm install budoux
17```
18
19## Usage
20
21### Simple usage
22
23You can get a list of phrases by feeding a sentence to the parser.
24The easiest way is to get a parser is loading the default parser for each language.
25
26**Japanese:**
27
28```javascript
29import { loadDefaultJapaneseParser } from 'budoux';
30const parser = loadDefaultJapaneseParser();
31console.log(parser.parse('今日は天気です。'));
32// ['今日は', '天気です。']
33```
34
35**Simplified Chinese:**
36
37```javascript
38import { loadDefaultSimplifiedChineseParser } from 'budoux';
39const parser = loadDefaultSimplifiedChineseParser();
40console.log(parser.parse('是今天的天气。'));
41// ['是', '今天', '的', '天气。']
42```
43
44**Traditional Chinese:**
45
46```javascript
47import { loadDefaultTraditionalChineseParser } from 'budoux';
48const parser = loadDefaultTraditionalChineseParser();
49console.log(parser.parse('是今天的天氣。'));
50// ['是', '今天', '的', '天氣。']
51```
52
53**Thai:**
54
55```javascript
56import { loadDefaultThaiParser } from 'budoux';
57const parser = loadDefaultThaiParser();
58console.log(parser.parse('วันนี้อากาศดี'));
59// ['วัน', 'นี้', 'อากาศ', 'ดี']
60```
61
62### Translating an HTML string
63
64You can also translate an HTML string to wrap phrases with non-breaking markup,
65specifically, zero-width spaces (U+200B).
66
67```javascript
68console.log(parser.translateHTMLString('今日は<b>とても天気</b>です。'));
69// <span style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</span>
70```
71
72Please note that separators are denoted as `\u200b` in the example above for
73illustrative purposes, but the actual output is an invisible string as it's a
74zero-width space.
75
76### Applying to an HTML element
77
78You can also feed an HTML element to the parser to apply the process.
79
80```javascript
81const ele = document.querySelector('p.budou-this');
82console.log(ele.outerHTML);
83// <p class="budou-this">今日は<b>とても天気</b>です。</p>
84parser.applyToElement(ele);
85console.log(ele.outerHTML);
86// <p class="budou-this" style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</p>
87```
88
89Internally, the `applyToElement` calls the [`HTMLProcessor`]'s `applyToElement`
90function with the zero-width space as the separator.
91
92You can use the [`HTMLProcessor`] class directly if desired.
93For example:
94
95```javascript
96import { HTMLProcessor } from 'budoux';
97const ele = document.querySelector('p.budou-this');
98const htmlProcessor = new HTMLProcessor(parser, {
99 separator: ' '
100});
101htmlProcessor.applyToElement(ele);
102```
103
104[`HTMLProcessor`]: https://github.com/google/budoux/blob/main/javascript/src/html_processor.ts
105
106### Loading a custom model
107
108You can load your own custom model as follows.
109
110```javascript
111import { Parser } from 'budoux';
112const model = JSON.parse('{"UW4": {"a": 133}}'); // Content of the custom model JSON file.
113const parser = new Parser(model);
114parser.parse('xyzabc'); // ['xyz', 'abc']
115```
116
117## Web components
118
119BudouX also offers Web components to integrate the parser with your website quickly.
120All you have to do is wrap sentences with:
121
122- `<budoux-ja>` for Japanese
123- `<budoux-zh-hans>` for Simplified Chinese
124- `<budoux-zh-hant>` for Traditional Chinese
125- `<budoux-th>` for Thai
126
127```html
128<budoux-ja>今日は天気です。</budoux-ja>
129<budoux-zh-hans>今天是晴天。</budoux-zh-hans>
130<budoux-zh-hant>今天是晴天。</budoux-zh-hant>
131<budoux-th>วันนี้อากาศดี</budoux-th>
132```
133
134In order to enable the custom element, you can simply add this line to load the bundle.
135
136```html
137<!-- For Japanese -->
138<script src="https://unpkg.com/budoux/bundle/budoux-ja.min.js"></script>
139
140<!-- For Simplified Chinese -->
141<script src="https://unpkg.com/budoux/bundle/budoux-zh-hans.min.js"></script>
142
143<!-- For Traditional Chinese -->
144<script src="https://unpkg.com/budoux/bundle/budoux-zh-hant.min.js"></script>
145
146<!-- For Thai -->
147<script src="https://unpkg.com/budoux/bundle/budoux-th.min.js"></script>
148```
149
150Otherwise, if you wish to bundle the component with the rest of your source code,
151you can import the component as shown below.
152
153```javascript
154// For Japanese
155import 'budoux/module/webcomponents/budoux-ja';
156
157// For Simplified Chinese
158import 'budoux/module/webcomponents/budoux-zh-hans';
159
160// For Traditional Chinese
161import 'budoux/module/webcomponents/budoux-zh-hant';
162
163// For Thai
164import 'budoux/module/webcomponents/budoux-th';
165```
166
167### CLI
168
169You can also format inputs on your terminal with `budoux` command.
170
171```shellsession
172$ budoux 本日は晴天です。
173本日は
174晴天です。
175```
176
177```shellsession
178$ echo $'本日は晴天です。\n明日は曇りでしょう。' | budoux
179本日は
180晴天です。
181---
182明日は
183曇りでしょう。
184```
185
186```shellsession
187$ budoux 本日は晴天です。 -H
188<span style="word-break: keep-all; overflow-wrap: anywhere;">本日は\u200b晴天です。</span>
189```
190
191Please note that separators are denoted as `\u200b` in the example above for
192illustrative purposes, but the actual output is an invisible string as it's a
193zero-width space.
194
195If you want to see help, run `budoux -h`.
196
197```shellsession
198$ budoux -h
199Usage: budoux [-h] [-H] [-d STR] [-m JSON] [-V] [TXT]
200
201BudouX is the successor to Budou, the machine learning powered line break organizer tool.
202
203Arguments:
204 txt text
205
206Options:
207 -H, --html HTML mode (default: false)
208 -d, --delim <str> output delimiter in TEXT mode (default: "---")
209 -m, --model <json> custom model file path
210 -V, --version output the version number
211 -h, --help display help for command
212```
213
214## Caveat
215
216BudouX supports HTML inputs and outputs HTML strings with markup applied to wrap
217phrases, but it's not meant to be used as an HTML sanitizer.
218**BudouX doesn't sanitize any inputs.**
219Malicious HTML inputs yield malicious HTML outputs.
220Please use it with an appropriate sanitizer library if you don't trust the input.
221
222## Author
223
224[Shuhei Iitsuka](https://tushuhei.com)
225
226## Disclaimer
227
228This is not an officially supported Google product.