1 | # Keyword Extractor
|
2 |
|
3 | [![Tests Status](https://github.com/michaeldelorenzo/keyword-extractor/workflows/test/badge.svg)](https://github.com/michaeldelorenzo/keyword-extractor/actions)
|
4 |
|
5 | A simple [NPM package](https://npmjs.org/package/keyword-extractor) for extracting _keywords_ from a string by
|
6 | removing stopwords.
|
7 |
|
8 | ## Installation
|
9 |
|
10 | ```sh
|
11 | $ npm install keyword-extractor
|
12 | ```
|
13 |
|
14 | ## Running tests
|
15 |
|
16 | To run the test suite, first install the development dependencies by running the following command within the package's
|
17 | directory.
|
18 |
|
19 | ```sh
|
20 | $ npm install
|
21 | ```
|
22 |
|
23 | To execute the package's tests, run:
|
24 |
|
25 | ``` sh
|
26 | $ make test
|
27 | ```
|
28 |
|
29 | ## Usage of the Module
|
30 |
|
31 | ```javascript
|
32 | // include the Keyword Extractor
|
33 | const keyword_extractor = require("keyword-extractor");
|
34 |
|
35 | // Opening sentence to NY Times Article at
|
36 | /*
|
37 | http://www.nytimes.com/2013/09/10/world/middleeast/
|
38 | surprise-russian-proposal-catches-obama-between-putin-and-house-republicans.html
|
39 | */
|
40 | const sentence =
|
41 | "President Obama woke up Monday facing a Congressional defeat that many in both parties believed could hobble his presidency."
|
42 |
|
43 | // Extract the keywords
|
44 | const extraction_result =
|
45 | keyword_extractor.extract(sentence,{
|
46 | language:"english",
|
47 | remove_digits: true,
|
48 | return_changed_case:true,
|
49 | remove_duplicates: false
|
50 |
|
51 | });
|
52 |
|
53 | /*
|
54 | extraction result is:
|
55 |
|
56 | [
|
57 | "president",
|
58 | "obama",
|
59 | "woke",
|
60 | "monday",
|
61 | "facing",
|
62 | "congressional",
|
63 | "defeat",
|
64 | "parties",
|
65 | "believed",
|
66 | "hobble",
|
67 | "presidency"
|
68 | ]
|
69 | */
|
70 | ```
|
71 |
|
72 | ### Options Parameters
|
73 |
|
74 | The second argument of the _extract_ method is an Object of configuration/processing settings for the extraction.
|
75 |
|
76 | Parameter Name | Description | Permitted Values
|
77 | ---------------|-------------|-----------------
|
78 | language | The stopwords list to use. | _english_, _spanish_, _polish_, _german_, _french_, _italian_, _dutch_, _romanian_, _russian_, _portuguese_, _swedish_, _arabic_, _persian_
|
79 | remove_digits | Removes all digits from the results if set to true (can handle Arabic and Perisan digits too) | _true_ or _false_
|
80 | return_changed_case | The case of the extracted keywords. Setting the value to _true_ will return the results all lower-cased, if _false_ the results will be in the original case. | _true_ or _false_
|
81 | return_chained_words | Instead of returning each word separately, join the words that were originally together. Setting the value to _true_ will join the words, if _false_ the results will be splitted on each array element. | _true_ or _false_
|
82 | remove_duplicates | Removes the duplicate keywords | _true_ , _false_ (defaults to _false_ )
|
83 | return_max_ngrams | Returns keywords that are ngrams with size 0-_integer_ | _integer_ , _false_ (defaults to _false_ )
|
84 |
|
85 |
|
86 | ## Credits
|
87 |
|
88 | The initial stopwords lists are taken from the following sources:
|
89 |
|
90 | - English [http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop]
|
91 | - Spanish [https://stop-words.googlecode.com/svn/trunk/stop-words/stop-words/stop-words-spanish.txt]
|