UNPKG

21.4 kBMarkdownView Raw
1# parse-domain
2
3**Splits a hostname into subdomains, domain and (effective) top-level domains.**
4
5[![Version on NPM](https://img.shields.io/npm/v/parse-domain?style=for-the-badge)](https://www.npmjs.com/package/parse-domain)
6[![Semantically released](https://img.shields.io/badge/%20%20%F0%9F%93%A6%F0%9F%9A%80-semantic--release-e10079.svg?style=for-the-badge)](https://github.com/semantic-release/semantic-release)
7[![Monthly downloads on NPM](https://img.shields.io/npm/dm/parse-domain?style=for-the-badge)](https://www.npmjs.com/package/parse-domain)<br>
8[![NPM Bundle size minified](https://img.shields.io/bundlephobia/min/parse-domain?style=for-the-badge)](https://bundlephobia.com/result?p=parse-domain)
9[![NPM Bundle size minified and gzipped](https://img.shields.io/bundlephobia/minzip/parse-domain?style=for-the-badge)](https://bundlephobia.com/result?p=parse-domain)<br>
10[![License](https://img.shields.io/npm/l/parse-domain?style=for-the-badge)](./LICENSE)
11
12Since domain name registrars organize their namespaces in different ways, it's not straight-forward to split a hostname into subdomains, the domain and top-level domains. In order to do that **parse-domain** uses a [large list of known top-level domains](https://publicsuffix.org/list/public_suffix_list.dat) from [publicsuffix.org](https://publicsuffix.org/):
13
14```javascript
15import { parseDomain, ParseResultType } from "parse-domain";
16
17const parseResult = parseDomain(
18 // This should be a string with basic latin letters only.
19 // More information below.
20 "www.some.example.co.uk"
21);
22
23// Check if the domain is listed in the public suffix list
24if (parseResult.type === ParseResultType.Listed) {
25 const { subDomains, domain, topLevelDomains } = parseResult;
26
27 console.log(subDomains); // ["www", "some"]
28 console.log(domain); // "example"
29 console.log(topLevelDomains); // ["co", "uk"]
30} else {
31 // Read more about other parseResult types below...
32}
33```
34
35This package has been designed for modern Node and browser environments, supporting both CommonJS and ECMAScript modules. It assumes an ES2015 environment with [`Symbol()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Symbol), [`URL()`](https://developer.mozilla.org/en-US/docs/Web/API/URL) and [`TextDecoder()](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder) globally available. You need to transpile it down to ES5 (e.g. by using [Babel](https://babeljs.io/)) if you need to support older environments.
36
37The list of top-level domains is stored in a [trie](https://en.wikipedia.org/wiki/Trie) data structure and serialization format to ensure the fastest lookup and the smallest possible library size. The library is side-effect free (this is important for proper [tree-shaking](https://webpack.js.org/guides/tree-shaking/)).
38
39<br />
40
41## Installation
42
43```sh
44npm install parse-domain
45```
46
47## Updates
48
49💡 **Please note:** [publicsuffix.org](https://publicsuffix.org/) is updated several times per month. This package comes with a prebuilt list that has been downloaded at the time of `npm publish`. In order to get an up-to-date list, you should run `npx parse-domain-update` everytime you start or build your application. This will download the latest list from `https://publicsuffix.org/list/public_suffix_list.dat`.
50
51<br />
52
53## Expected input
54
55**⚠️ [`parseDomain`](#api-js-parseDomain) does not parse whole URLs**. You should only pass the [puny-encoded](https://en.wikipedia.org/wiki/Punycode) hostname section of the URL:
56
57| ❌ Wrong | ✅ Correct |
58| ---------------------------------------------- | -------------------- |
59| `https://user@www.example.com:8080/path?query` | `www.example.com` |
60| `münchen.de` | `xn--mnchen-3ya.de` |
61| `食狮.com.cn?query` | `xn--85x722f.com.cn` |
62
63There is the utility function [`fromUrl`](#api-js-fromUrl) which tries to extract the hostname from a (partial) URL and puny-encodes it:
64
65```javascript
66import { parseDomain, fromUrl } from "parse-domain";
67
68const { subDomains, domain, topLevelDomains } = parseDomain(
69 fromUrl("https://www.münchen.de?query")
70);
71
72console.log(subDomains); // ["www"]
73console.log(domain); // "xn--mnchen-3ya"
74console.log(topLevelDomains); // ["de"]
75
76// You can use the 'punycode' NPM package to decode the domain again
77import { toUnicode } from "punycode";
78
79console.log(toUnicode(domain)); // "münchen"
80```
81
82[`fromUrl`](#api-js-fromUrl) parses the URL using [`new URL()`](https://developer.mozilla.org/en-US/docs/Web/API/URL). Depending on your target environments you need to make sure that there is a [polyfill](https://www.npmjs.com/package/whatwg-url) for it. It's globally available in [all modern browsers](https://caniuse.com/#feat=url) (no IE) and in [Node v10](https://nodejs.org/api/url.html#url_class_url).
83
84## Expected output
85
86When parsing a hostname there are 5 possible results:
87
88- invalid
89- it is an ip address
90- it is formally correct and the domain is
91 - reserved
92 - not listed in the public suffix list
93 - listed in the public suffix list
94
95[`parseDomain`](#api-js-parseDomain) returns a [`ParseResult`](#api-ts-ParseResult) with a `type` property that allows to distinguish these cases.
96
97### 👉 Invalid domains
98
99The given input is first validated against [RFC 3696](https://datatracker.ietf.org/doc/html/rfc3696#section-2) (the domain labels are limited to basic latin letters, numbers and hyphens). If the validation fails, `parseResult.type` will be `ParseResultType.Invalid`:
100
101```javascript
102import { parseDomain, ParseResultType } from "parse-domain";
103
104const parseResult = parseDomain("münchen.de");
105
106console.log(parseResult.type === ParseResultType.Invalid); // true
107```
108
109Check out the [API](#api-ts-ValidationError) if you need more information about the validation error.
110
111If you don't want the characters to be validated (e.g. because you need to allow underscores in hostnames), there's also a more relaxed validation mode (according to [RFC 2181](https://www.rfc-editor.org/rfc/rfc2181#section-11)).
112
113```javascript
114import { parseDomain, ParseResultType, Validation } from "parse-domain";
115
116const parseResult = parseDomain("_jabber._tcp.gmail.com", {
117 validation: Validation.Lax,
118});
119
120console.log(parseResult.type === ParseResultType.Listed); // true
121```
122
123See also [#134](https://github.com/peerigon/parse-domain/issues/134) for the discussion.
124
125### 👉 IP addresses
126
127If the given input is an IP address, `parseResult.type` will be `ParseResultType.Ip`:
128
129```javascript
130import { parseDomain, ParseResultType } from "parse-domain";
131
132const parseResult = parseDomain("192.168.2.1");
133
134console.log(parseResult.type === ParseResultType.Ip); // true
135console.log(parseResult.ipVersion); // 4
136```
137
138It's debatable if a library for parsing domains should also accept IP addresses. In fact, you could argue that [`parseDomain`](#api-js-parseDomain) should reject an IP address as invalid. While this is true from a technical point of view, we decided to report IP addresses in a special way because we assume that a lot of people are using this library to make sense from an arbitrary hostname (see [#102](https://github.com/peerigon/parse-domain/issues/102)).
139
140### 👉 Reserved domains
141
142There are 5 top-level domains that are not listed in the public suffix list but reserved according to [RFC 6761](https://tools.ietf.org/html/rfc6761) and [RFC 6762](https://tools.ietf.org/html/rfc6762):
143
144- `localhost`
145- `local`
146- `example`
147- `invalid`
148- `test`
149
150In these cases, `parseResult.type` will be `ParseResultType.Reserved`:
151
152```javascript
153import { parseDomain, ParseResultType } from "parse-domain";
154
155const parseResult = parseDomain("pecorino.local");
156
157console.log(parseResult.type === ParseResultType.Reserved); // true
158console.log(parseResult.labels); // ["pecorino", "local"]
159```
160
161### 👉 Domains that are not listed
162
163If the given hostname is valid, but not listed in the downloaded public suffix list, `parseResult.type` will be `ParseResultType.NotListed`:
164
165```javascript
166import { parseDomain, ParseResultType } from "parse-domain";
167
168const parseResult = parseDomain("this.is.not-listed");
169
170console.log(parseResult.type === ParseResultType.NotListed); // true
171console.log(parseResult.labels); // ["this", "is", "not-listed"]
172```
173
174If a domain is not listed, it can be caused by an outdated list. Make sure to [update the list once in a while](#installation).
175
176⚠️ **Do not treat parseDomain as authoritative answer.** It cannot replace a real DNS lookup to validate if a given domain is known in a certain network.
177
178### 👉 Effective top-level domains
179
180Technically, the term _top-level domain_ describes the very last domain in a hostname (`uk` in `example.co.uk`). Most people, however, use the term _top-level domain_ for the _public suffix_ which is a namespace ["under which Internet users can directly register names"](https://publicsuffix.org/).
181
182Some examples for public suffixes:
183
184- `com` in `example.com`
185- `co.uk` in `example.co.uk`
186- `co` in `example.co`
187- but also `com.co` in `example.com.co`
188
189If the hostname is listed in the public suffix list, the `parseResult.type` will be `ParseResultType.Listed`:
190
191```javascript
192import { parseDomain, ParseResultType } from "parse-domain";
193
194const parseResult = parseDomain("example.co.uk");
195
196console.log(parseResult.type === ParseResultType.Listed); // true
197console.log(parseResult.labels); // ["example", "co", "uk"]
198```
199
200Now `parseResult` will also provide a `subDomains`, `domain` and `topLevelDomains` property:
201
202```javascript
203const { subDomains, domain, topLevelDomains } = parseResult;
204
205console.log(subDomains); // []
206console.log(domain); // "example"
207console.log(topLevelDomains); // ["co", "uk"]
208```
209
210### 👉 Switch over `parseResult.type` to distinguish between different parse results
211
212We recommend switching over the `parseResult.type`:
213
214```javascript
215switch (parseResult.type) {
216 case ParseResultType.Listed: {
217 const { hostname, topLevelDomains } = parseResult;
218
219 console.log(`${hostname} belongs to ${topLevelDomains.join(".")}`);
220 break;
221 }
222 case ParseResultType.Reserved:
223 case ParseResultType.NotListed: {
224 const { hostname } = parseResult;
225
226 console.log(`${hostname} is a reserved or unknown domain`);
227 break;
228 }
229 default:
230 throw new Error(`${hostname} is an ip address or invalid domain`);
231}
232```
233
234### ⚠️ Effective TLDs !== TLDs acknowledged by ICANN
235
236What's surprising to a lot of people is that the definition of public suffix means that regular user domains can become effective top-level domains:
237
238```javascript
239const { subDomains, domain, topLevelDomains } = parseDomain(
240 "parse-domain.github.io"
241);
242
243console.log(subDomains); // []
244console.log(domain); // "parse-domain"
245console.log(topLevelDomains); // ["github", "io"] 🤯
246```
247
248In this case, `github.io` is nothing else than a private domain name registrar. `github.io` is the _effective_ top-level domain and browsers are treating it like that (e.g. for setting [`document.domain`](https://developer.mozilla.org/en-US/docs/Web/API/Document/domain)).
249
250If you want to deviate from the browser's understanding of a top-level domain and you're only interested in top-level domains acknowledged by [ICANN](https://en.wikipedia.org/wiki/ICANN), there's an `icann` property:
251
252```javascript
253const parseResult = parseDomain("parse-domain.github.io");
254const { subDomains, domain, topLevelDomains } = parseResult.icann;
255
256console.log(subDomains); // ["parse-domain"]
257console.log(domain); // "github"
258console.log(topLevelDomains); // ["io"]
259```
260
261### ⚠️ `domain` can also be `undefined`
262
263```javascript
264const { subDomains, domain, topLevelDomains } = parseDomain("co.uk");
265
266console.log(subDomains); // []
267console.log(domain); // undefined
268console.log(topLevelDomains); // ["co", "uk"]
269```
270
271### ⚠️ `""` is a valid (but reserved) domain
272
273The empty string `""` represents the [DNS root](https://en.wikipedia.org/wiki/DNS_root_zone) and is considered to be valid. `parseResult.type` will be `ParseResultType.Reserved` in that case:
274
275```javascript
276const { type, subDomains, domain, topLevelDomains } = parseDomain("");
277
278console.log(type === ParseResultType.Reserved); // true
279console.log(subDomains); // []
280console.log(domain); // undefined
281console.log(topLevelDomains); // []
282```
283
284## API
285
286🧩 = JavaScript export<br>
287🧬 = TypeScript export
288
289<h3 id="api-js-parseDomain">
290🧩 <code>export parseDomain(hostname: string | typeof <a href="#api-js-NO_HOSTNAME">NO_HOSTNAME</a>, options?: <a href="#api-ts-ParseDomainOptions">ParseDomainOptions</a>): <a href="#api-ts-ParseResult">ParseResult</a></code>
291</h3>
292
293Takes a hostname (e.g. `"www.example.com"`) and returns a [`ParseResult`](#api-ts-ParseResult). The hostname must only contain basic latin letters, digits, hyphens and dots. International hostnames must be puny-encoded. Does not throw an error, even with invalid input.
294
295```javascript
296import { parseDomain } from "parse-domain";
297
298const parseResult = parseDomain("www.example.com");
299```
300
301Use `Validation.Lax` if you want to allow all characters:
302
303```javascript
304import { parseDomain, Validation } from "parse-domain";
305
306const parseResult = parseDomain("_jabber._tcp.gmail.com", {
307 validation: Validation.Lax,
308});
309```
310
311<h3 id="api-js-fromUrl">
312🧩 <code>export fromUrl(input: string): string | typeof <a href="#api-js-NO_HOSTNAME">NO_HOSTNAME</a></code>
313</h3>
314
315Takes a URL-like string and tries to extract the hostname. Requires the global [`URL` constructor](https://developer.mozilla.org/en-US/docs/Web/API/URL) to be available on the platform. Returns the [`NO_HOSTNAME`](#api-js-NO_HOSTNAME) symbol if the input was not a string or the hostname could not be extracted. Take a look [at the test suite](/src/from-url.test.ts) for some examples. Does not throw an error, even with invalid input.
316
317<h3 id="api-js-NO_HOSTNAME">
318🧩 <code>export NO_HOSTNAME: unique symbol</code>
319</h3>
320
321`NO_HOSTNAME` is a symbol that is returned by [`fromUrl`](#api-js-fromUrl) when it was not able to extract a hostname from the given string. When passed to [`parseDomain`](#api-js-parseDomain), it will always yield a [`ParseResultInvalid`](#api-ts-ParseResultInvalid).
322
323<h3 id="api-ts-ParseDomainOptions">
324🧬 <code>export type ParseDomainOptions</code>
325</h3>
326
327```ts
328export type ParseDomainOptions = {
329 /**
330 * If no validation is specified, Validation.Strict will be used.
331 **/
332 validation?: Validation;
333};
334```
335
336<h3 id="api-js-Validation">
337🧩 <code>export Validation</code>
338</h3>
339
340An object that holds all possible [Validation](#api-ts-Validation) `validation` values:
341
342```javascript
343export const Validation = {
344 /**
345 * Allows any octets as labels
346 * but still restricts the length of labels and the overall domain.
347 *
348 * @see https://www.rfc-editor.org/rfc/rfc2181#section-11
349 **/
350 Lax: "LAX",
351
352 /**
353 * Only allows ASCII letters, digits and hyphens (aka LDH),
354 * forbids hyphens at the beginning or end of a label
355 * and requires top-level domain names not to be all-numeric.
356 *
357 * This is the default if no validation is configured.
358 *
359 * @see https://datatracker.ietf.org/doc/html/rfc3696#section-2
360 */
361 Strict: "STRICT",
362};
363```
364
365<h3 id="api-ts-Validation">
366🧬 <code>export Validation</code>
367</h3>
368
369This type represents all possible `validation` values.
370
371<h3 id="api-ts-ParseResult">
372🧬 <code>export ParseResult</code>
373</h3>
374
375A `ParseResult` is either a [`ParseResultInvalid`](#api-ts-ParseResultInvalid), [`ParseResultIp`](#api-ts-ParseResultIp), [`ParseResultReserved`](#api-ts-ParseResultReserved), [`ParseResultNotListed`](#api-ts-ParseResultNotListed) or [`ParseResultListed`](#api-ts-ParseResultListed).
376
377All parse results have a `type` property that is either `"INVALID"`, `"IP"`,`"RESERVED"`,`"NOT_LISTED"`or`"LISTED"`. Use the exported [ParseResultType](#api-js-ParseResultType) to check for the type instead of checking against string literals.
378
379All parse results also have a `hostname` property that provides access to the sanitized hostname that was passed to [`parseDomain`](#api-js-parseDomain).
380
381<h3 id="api-js-ParseResultType">
382🧩 <code>export ParseResultType</code>
383</h3>
384
385An object that holds all possible [ParseResult](#api-ts-ParseResult) `type` values:
386
387```javascript
388const ParseResultType = {
389 Invalid: "INVALID",
390 Ip: "IP",
391 Reserved: "RESERVED",
392 NotListed: "NOT_LISTED",
393 Listed: "LISTED",
394};
395```
396
397<h3 id="api-ts-ParseResultType">
398🧬 <code>export ParseResultType</code>
399</h3>
400
401This type represents all possible [ParseResult](#api-ts-ParseResult) `type` values.
402
403<h3 id="api-ts-ParseResultInvalid">
404🧬 <code>export ParseResultInvalid</code>
405</h3>
406
407Describes the shape of the parse result that is returned when the given hostname does not adhere to [RFC 1034](https://tools.ietf.org/html/rfc1034):
408
409- The hostname is not a string
410- The hostname is longer than 253 characters
411- A domain label is shorter than 1 character
412- A domain label is longer than 63 characters
413- A domain label contains a character that is not a basic latin character, digit or hyphen
414
415```ts
416type ParseResultInvalid = {
417 type: ParseResultType.INVALID;
418 hostname: string | typeof NO_HOSTNAME;
419 errors: Array<ValidationError>;
420};
421
422// Example
423
424{
425 type: "INVALID",
426 hostname: ".com",
427 errors: [...]
428}
429```
430
431<h3 id="api-ts-ValidationError">
432🧬 <code>export ValidationError</code>
433</h3>
434
435Describes the shape of a validation error as returned by [`parseDomain`](#api-js-parseDomain)
436
437```ts
438type ValidationError = {
439 type: ValidationErrorType;
440 message: string;
441 column: number;
442};
443
444// Example
445
446{
447 type: "LABEL_MIN_LENGTH",
448 message: `Label "" is too short. Label is 0 octets long but should be at least 1.`,
449 column: 1,
450}
451```
452
453<h3 id="api-js-ValidationErrorType">
454🧩 <code>export ValidationErrorType</code>
455</h3>
456
457An object that holds all possible [ValidationError](#api-ts-ValidationError) `type` values:
458
459```javascript
460const ValidationErrorType = {
461 NoHostname: "NO_HOSTNAME",
462 DomainMaxLength: "DOMAIN_MAX_LENGTH",
463 LabelMinLength: "LABEL_MIN_LENGTH",
464 LabelMaxLength: "LABEL_MAX_LENGTH",
465 LabelInvalidCharacter: "LABEL_INVALID_CHARACTER",
466 LastLabelInvalid: "LAST_LABEL_INVALID",
467};
468```
469
470<h3 id="api-ts-ValidationErrorType">
471🧬 <code>export ValidationErrorType</code>
472</h3>
473
474This type represents all possible `type` values of a [ValidationError](#api-ts-ValidationError).
475
476<h3 id="api-ts-ParseResultIp">
477🧬 <code>export ParseResultIp</code>
478</h3>
479
480This type describes the shape of the parse result that is returned when the given hostname was an IPv4 or IPv6 address.
481
482```ts
483type ParseResultIp = {
484 type: ParseResultType.Ip;
485 hostname: string;
486 ipVersion: 4 | 6;
487};
488
489// Example
490
491{
492 type: "IP",
493 hostname: "192.168.0.1",
494 ipVersion: 4
495}
496```
497
498According to [RFC 3986](https://tools.ietf.org/html/rfc3986#section-3.2.2), IPv6 addresses need to be surrounded by `[` and `]` in URLs. [`parseDomain`](#api-js-parseDomain) accepts both IPv6 address with and without square brackets:
499
500```js
501// Recognized as IPv4 address
502parseDomain("192.168.0.1");
503// Both are recognized as proper IPv6 addresses
504parseDomain("::");
505parseDomain("[::]");
506```
507
508<h3 id="api-ts-ParseResultReserved">
509🧬 <code>export ParseResultReserved</code>
510</h3>
511
512This type describes the shape of the parse result that is returned when the given hostname
513
514- is the root domain (the empty string `""`)
515- belongs to the top-level domain `localhost`, `local`, `example`, `invalid` or `test`
516
517```ts
518type ParseResultReserved = {
519 type: ParseResultType.Reserved;
520 hostname: string;
521 labels: Array<string>;
522};
523
524// Example
525
526{
527 type: "RESERVED",
528 hostname: "pecorino.local",
529 labels: ["pecorino", "local"]
530}
531```
532
533⚠️ Reserved IPs, such as `127.0.0.1`, will not be reported as reserved, but as <a href="#-export-parseresultip">`ParseResultIp`</a>. See [#117](https://github.com/peerigon/parse-domain/issues/117).
534
535<h3 id="api-ts-ParseResultNotListed">
536🧬 <code>export ParseResultNotListed</code>
537</h3>
538
539Describes the shape of the parse result that is returned when the given hostname is valid and does not belong to a reserved top-level domain, but is not listed in the downloaded public suffix list.
540
541```ts
542type ParseResultNotListed = {
543 type: ParseResultType.NotListed;
544 hostname: string;
545 labels: Array<string>;
546};
547
548// Example
549
550{
551 type: "NOT_LISTED",
552 hostname: "this.is.not-listed",
553 labels: ["this", "is", "not-listed"]
554}
555```
556
557<h3 id="api-ts-ParseResultListed">
558🧬 <code>export ParseResultListed</code>
559</h3>
560
561Describes the shape of the parse result that is returned when the given hostname belongs to a top-level domain that is listed in the public suffix list.
562
563```ts
564type ParseResultListed = {
565 type: ParseResultType.Listed;
566 hostname: string;
567 labels: Array<string>;
568 subDomains: Array<string>;
569 domain: string | undefined;
570 topLevelDomains: Array<string>;
571 icann: {
572 subDomains: Array<string>;
573 domain: string | undefined;
574 topLevelDomains: Array<string>;
575 };
576};
577
578// Example
579
580{
581 type: "LISTED",
582 hostname: "parse-domain.github.io",
583 labels: ["parse-domain", "github", "io"]
584 subDomains: [],
585 domain: "parse-domain",
586 topLevelDomains: ["github", "io"],
587 icann: {
588 subDomains: ["parse-domain"],
589 domain: "github",
590 topLevelDomains: ["io"]
591 }
592}
593```
594
595## License
596
597MIT
598
599## Sponsors
600
601[<img src="https://assets.peerigon.com/peerigon/logo/peerigon-logo-flat-spinat.png" width="150" />](https://peerigon.com)