UNPKG

natural/README.md

Version:
53.4 kBMarkdownView Raw
1natural
2=======
3
4[![NPM version](https://img.shields.io/npm/v/natural.svg)](https://www.npmjs.com/package/natural)
5[![Build Status](https://travis-ci.org/NaturalNode/natural.png?branch=master)](https://travis-ci.org/NaturalNode/natural)
6[![Slack](https://slack.bri.im/badge.svg)](https://slack.bri.im)
7
8"Natural" is a general natural language facility for nodejs. Tokenizing,
9stemming, classification, phonetics, tf-idf, WordNet, string similarity,
10and some inflections are currently supported.
11
12It's still in the early stages, so we're very interested in bug reports,
13contributions and the like.
14
15Note that many algorithms from Rob Ellis's [node-nltools](https://github.com/NaturalNode/node-nltools) are
16being merged into this project and will be maintained from here onward.
17
18While most of the algorithms are English-specific, contributors have implemented support for other languages. Thanks to Polyakov Vladimir, Russian stemming has been added! Thanks to David Przybilla, Spanish stemming has been added! Thanks to [even more contributors](https://github.com/NaturalNode/natural/graphs/contributors), stemming and tokenizing in more languages have been added.
19
20Aside from this README, the only documentation is [this DZone article](http://www.dzone.com/links/r/using_natural_a_nlp_module_for_nodejs.html), [this course on Egghead.io](https://egghead.io/courses/natural-language-processing-in-javascript-with-natural), and [here on my blog](http://www.chrisumbel.com/article/node_js_natural_language_porter_stemmer_lancaster_bayes_naive_metaphone_soundex). The README is up to date, the other sources are somewhat outdated.
21
22### TABLE OF CONTENTS
23
24* [Installation](#installation)
25* [Tokenizers](#tokenizers)
26* [String Distance](#string-distance)
27* [Approximate String Matching](#approximate-string-matching)
28* [Stemmers](#stemmers)
29* [Classifiers](#classifiers)
30  * [Bayesian and logistic regression](#bayesian-and-logistic-regression)
31  * [Maximum Entropy Classifier](#maximum-entropy-classifier)
32* [Sentiment Analysis](#sentiment-analysis)
33* [Phonetics](#phonetics)
34* [Inflectors](#inflectors)
35* [N-Grams](#n-grams)
36* [tf-idf](#tf-idf)
37* [Tries](#tries)
38* [EdgeWeightedDigraph](#edgeweighteddigraph)
39* [ShortestPathTree](#shortestpathtree)
40* [LongestPathTree](#longestpathtree)
41* [WordNet](#wordnet)
42* [Spellcheck](#spellcheck)
43* [POS Tagger](#pos-tagger)
44* [Development](#development)
45* [License](#license)
46
47
48## Installation
49
50If you're just looking to use natural without your own node application,
51you can install via NPM like so:
52
53    npm install natural
54
55If you're interested in contributing to natural, or just hacking on it, then by all
56means fork away!
57
58## Tokenizers
59
60Word, Regexp, and [Treebank tokenizers](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html) are provided for breaking text up into
61arrays of tokens:
62
63```javascript
64var natural = require('natural');
65var tokenizer = new natural.WordTokenizer();
66console.log(tokenizer.tokenize("your dog has fleas."));
67// [ 'your', 'dog', 'has', 'fleas' ]
68```
69
70The other tokenizers follow a similar pattern:
71
72```javascript
73tokenizer = new natural.TreebankWordTokenizer();
74console.log(tokenizer.tokenize("my dog hasn't any fleas."));
75// [ 'my', 'dog', 'has', 'n\'t', 'any', 'fleas', '.' ]
76
77tokenizer = new natural.RegexpTokenizer({pattern: /\-/});
78console.log(tokenizer.tokenize("flea-dog"));
79// [ 'flea', 'dog' ]
80
81tokenizer = new natural.WordPunctTokenizer();
82console.log(tokenizer.tokenize("my dog hasn't any fleas."));
83// [ 'my',  'dog',  'hasn',  '\'',  't',  'any',  'fleas',  '.' ]
84
85tokenizer = new natural.OrthographyTokenizer({language: "fi"});
86console.log(tokenizer.tokenize("Mikä sinun nimesi on?"));
87// [ 'Mikä', 'sinun', 'nimesi', 'on' ]
88```
89
90Overview of available tokenizers:
91
92| Tokenizer              | Language    | Explanation                                                             |
93|:-----------------------|:------------|:------------------------------------------------------------------------|
94| WordTokenizer          | Any         | Splits on anything except alphabetic characters, digits and underscore  |
95| WordPunctTokenizer     | Any         | Splits on anything except alphabetic characters, digits, punctuation and underscore  |
96| SentenceTokenizer      | Any         | Break string up in to parts based on punctation and quotation marks     |
97| CaseTokenizer          | Any?        | If lower and upper case are the same, the character is assumed to be whitespace or something else (punctuation) |
98| RegexpTokenizer        | Any         | Splits on a regular expression that either defines sequences of word characters or gap characters |
99| OrthographyTokenizer   | Finnish     | Splits on anything except alpabetic characters, digits and underscore   |
100| TreebankWordTokenizer  | Any         |  |
101| AggressiveTokenizer    | English     |  |
102| AggressiveTokenizerFa  | Farsi       |  |
103| AggressiveTokenizerFr  | French      |  |
104| AggressiveTokenizerRu  | Russian     |  |
105| AggressiveTokenizerEs  | Spanish     |  |
106| AggressiveTokenizerIt  | Italian     |  |
107| AggressiveTokenizerPl  | Polish      |  |
108| AggressiveTokenizerPt  | Portuguese  |  |
109| AggressiveTokenizerNo  | Norwegian   |  |
110| AggressiveTokenizerSv  | Swedish     |  |
111| AggressiveTokenizerVi  | Vietnamese  |  |
112| AggressiveTokenizerId  | Indonesian  |  |
113| TokenizerJa            | Japanese    |  |  |
114
115
116
117## String Distance
118
119Natural provides an implementation of three algorithms for calculating string distance: Hamming distance, Jaro-Winkler, Levenshtein distance, and Dice coefficient.
120
121[Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) measures the distance between two strings of equal length by counting the number of different characters. The third parameter indicates whether case should be ignored. By default the algorithm is case sensitive.
122```javascript
123var natural = require('natural');
124console.log(natural.HammingDistance("karolin", "kathrin", false));
125console.log(natural.HammingDistance("karolin", "kerstin", false));
126// If strings differ in length -1 is returned
127console.log(natural.HammingDistance("short string", "longer string", false));
128```
129
130Output:
131```javascript
1323
1333
134-1
135```
136
137
138The [Jaro–Winkler](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) string distance measuring algorithm will return a number between 0 and 1 which tells how closely the strings match (0 = not at all, 1 = exact match):
139
140```javascript
141var natural = require('natural');
142console.log(natural.JaroWinklerDistance("dixon","dicksonx"));
143console.log(natural.JaroWinklerDistance('not', 'same'));
144```
145
146Output:
147
148```javascript
1490.7466666666666666
1500
151```
152
153If the distance between the strings is already known you can pass it as a third parameter. And you can force the algorithm to ignore case by passing a fourth parameter as follows:
154```javascript
155var natural = require('natural');
156console.log(natural.JaroWinklerDistance("dixon","dicksonx", undefined, true));
157```
158
159
160Natural also offers support for [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance) distances:
161
162```javascript
163var natural = require('natural');
164console.log(natural.LevenshteinDistance("ones","onez"));
165console.log(natural.LevenshteinDistance('one', 'one'));
166```
167
168Output:
169
170```javascript
1711
1720
173```
174
175The cost of the three edit operations are modifiable for Levenshtein:
176
177```javascript
178console.log(natural.LevenshteinDistance("ones","onez", {
179    insertion_cost: 1,
180    deletion_cost: 1,
181    substitution_cost: 1
182}));
183```
184
185Output:
186
187```javascript
1881
189```
190
191Full Damerau-Levenshtein matching can be used if you want to consider character transpositions as a valid edit operation.
192
193```javascript
194console.log(natural.DamerauLevenshteinDistance("az", "za"));
195```
196
197Output:
198```javascript
1991
200```
201
202The transposition cost can be modified as well:
203
204```javascript
205console.log(natural.DamerauLevenshteinDistance("az", "za", { transposition_cost: 0 }))
206```
207
208Output:
209```javascript
2100
211```
212
213A restricted form of Damerau-Levenshtein (Optimal String Alignment) is available.
214
215This form of matching is more space efficient than unrestricted Damerau-Levenshtein, by only considering a transposition if there are no characters between the transposed characters.
216
217Comparison:
218
219```javascript
220// Optimal String Alignment
221console.log(natural.DamerauLevenshteinDistance('ABC', 'ACB'), { restricted: true });
2221
223console.log(natural.DamerauLevenshteinDistance('CA', 'ABC', { restricted: true }));
2242
225// Unrestricted Damerau-Levenshtein
226console.log(natural.DamerauLevenshteinDistance('CA', 'ABC', { restricted: false }));
2271
228```
229
230And [Dice's co-efficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient):
231
232```javascript
233var natural = require('natural');
234console.log(natural.DiceCoefficient('thing', 'thing'));
235console.log(natural.DiceCoefficient('not', 'same'));
236```
237
238Output:
239
240```javascript
2411
2420
243```
244
245## Approximate String Matching
246Currently matching is supported via the Levenshtein algorithm.
247
248```javascript
249var natural = require('natural');
250var source = 'The RainCoat BookStore';
251var target = 'All the best books are here at the Rain Coats Book Store';
252
253console.log(natural.LevenshteinDistance(source, target, {search: true}));
254```
255
256Output:
257
258```javascript
259{ substring: 'the Rain Coats Book Store', distance: 4 }
260```
261
262The following
263
264## Stemmers
265
266Currently stemming is supported via the [Porter](http://tartarus.org/martin/PorterStemmer/index.html) and [Lancaster](http://www.comp.lancs.ac.uk/computing/research/stemming/) (Paice/Husk) algorithms. The Indonesian and Japanese stemmers do not follow a known algorithm.
267
268```javascript
269var natural = require('natural');
270```
271
272This example uses a Porter stemmer. "word" is returned.
273
274```javascript
275console.log(natural.PorterStemmer.stem("words")); // stem a single word
276```
277
278 in Russian:
279
280```javascript
281console.log(natural.PorterStemmerRu.stem("падший"));
282```
283
284 in Spanish:
285
286```javascript
287console.log(natural.PorterStemmerEs.stem("jugaría"));
288```
289
290The following stemmers are available:
291
292| Language      | Porter      | Lancaster | Other     | Module            |
293|:------------- |:-----------:|:---------:|:---------:|:------------------|
294| Dutch         | X           |           |           | `PorterStemmerNl` |
295| English       | X           |           |           | `PorterStemmer`   |
296| English       |             |  X        |           | `LancasterStemmer` |
297| Farsi (in progress) |  X    |           |           | `PorterStemmerFa` |
298| French        | X           |           |           | `PorterStemmerFr` |
299| Indonesian    |             |           | X         | `StemmerId`       |
300| Italian       | X           |           |           | `PorterStemmerIt` |
301| Japanese      |             |           | X         | `StemmerJa`       |
302| Norwegian     | X           |           |           | `PorterStemmerNo` |
303| Portugese     | X           |           |           | `PorterStemmerPt` |
304| Russian       | X           |           |           | `PorterStemmerRu` |
305| Spanish       | X           |           |           | `PorterStemmerEs` |
306| Swedish       | X           |           |           | `PorterStemmerSv` |
307
308
309`attach()` patches `stem()` and `tokenizeAndStem()` to String as a shortcut to
310`PorterStemmer.stem(token)`. `tokenizeAndStem()` breaks text up into single words
311and returns an array of stemmed tokens.
312
313```javascript
314natural.PorterStemmer.attach();
315console.log("i am waking up to the sounds of chainsaws".tokenizeAndStem());
316console.log("chainsaws".stem());
317```
318
319The same thing can be done with a Lancaster stemmer:
320
321```javascript
322natural.LancasterStemmer.attach();
323console.log("i am waking up to the sounds of chainsaws".tokenizeAndStem());
324console.log("chainsaws".stem());
325```
326
327## Classifiers
328
329### Bayesian and logistic regression
330
331Two classifiers are currently supported, [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [logistic regression](http://en.wikipedia.org/wiki/Logistic_regression).
332The following examples use the BayesClassifier class, but the
333LogisticRegressionClassifier class could be substituted instead.
334
335```javascript
336var natural = require('natural');
337var classifier = new natural.BayesClassifier();
338```
339
340You can train the classifier on sample text. It will use reasonable defaults to
341tokenize and stem the text.
342
343```javascript
344classifier.addDocument('i am long qqqq', 'buy');
345classifier.addDocument('buy the q\'s', 'buy');
346classifier.addDocument('short gold', 'sell');
347classifier.addDocument('sell gold', 'sell');
348
349classifier.train();
350```
351
352Outputs "sell"
353
354```javascript
355console.log(classifier.classify('i am short silver'));
356```
357
358Outputs "buy"
359
360```javascript
361console.log(classifier.classify('i am long copper'));
362```
363
364You have access to the set of matched classes and the associated value from the classifier.
365
366Outputs:
367
368```javascript
369[ { label: 'buy', value: 0.39999999999999997 },
370  { label: 'sell', value: 0.19999999999999998 } ]
371```
372
373From this:
374
375```javascript
376console.log(classifier.getClassifications('i am long copper'));
377```
378
379The classifier can also be trained with and can classify arrays of tokens, strings, or
380any mixture of the two. Arrays let you use entirely custom data with your own
381tokenization/stemming, if you choose to implement it.
382
383```javascript
384classifier.addDocument(['sell', 'gold'], 'sell');
385```
386
387The training process can be monitored by subscribing to the event `trainedWithDocument` that's emitted by the classifier, this event's emitted each time a document is finished being trained against:
388```javascript
389    classifier.events.on('trainedWithDocument', function (obj) {
390       console.log(obj);
391       /* {
392       *   total: 23 // There are 23 total documents being trained against
393       *   index: 12 // The index/number of the document that's just been trained against
394       *   doc: {...} // The document that has just been indexed
395       *  }
396       */
397    });
398```
399A classifier can also be persisted and recalled so you can reuse a training
400
401```javascript
402classifier.save('classifier.json', function(err, classifier) {
403    // the classifier is saved to the classifier.json file!
404});
405```
406
407To recall from the classifier.json saved above:
408
409```javascript
410natural.BayesClassifier.load('classifier.json', null, function(err, classifier) {
411    console.log(classifier.classify('long SUNW'));
412    console.log(classifier.classify('short SUNW'));
413});
414```
415
416A classifier can also be serialized and deserialized like so:
417
418```javascript
419var classifier = new natural.BayesClassifier();
420classifier.addDocument(['sell', 'gold'], 'sell');
421classifier.addDocument(['buy', 'silver'], 'buy');
422
423// serialize
424var raw = JSON.stringify(classifier);
425// deserialize
426var restoredClassifier = natural.BayesClassifier.restore(JSON.parse(raw));
427console.log(restoredClassifier.classify('i should sell that'));
428```
429
430__Note:__ if using the classifier for languages other than English you may need
431to pass in the stemmer to use. In fact, you can do this for any stemmer including
432alternate English stemmers. The default is the `PorterStemmer`.
433
434```javascript
435const PorterStemmerRu = require('./node_modules/natural/lib/natural/stemmers/porter_stemmer_ru');
436var classifier = new natural.BayesClassifier(PorterStemmerRu);
437```
438
439### Maximum Entropy Classifier
440This module provides a classifier based on maximum entropy modelling. The central idea to maximum entropy modelling is to estimate a probability distribution that that has maximum entropy subject to the evidence that is available. This means that the distribution follows the data it has "seen" but does not make any assumptions beyond that.
441
442The module is not specific to natural language processing, or any other application domain. There are little requirements with regard to the data structure it can be trained on. For training, it needs a sample that consists of elements. These elements have two parts:
443* part a: the class of the element
444* part b: the context of the element
445The classifier will, once trained, return the most probable class for a particular context.
446
447We start with an explanation of samples and elements. You have to create your own specialisation of the Element class. Your element class should implement the generateFeatures method for inferring feature functions from the sample.
448
449#### Samples and elements
450Elements and contexts are created as follows:
451
452```javascript
453var MyElement = require('MyElementClass');
454var Context = require('Context');
455var Sample = require('Sample');
456
457var x = new MyElementClass("x", new Context("0"));
458// A sample is created from an array of elements
459var sample = new Sample();
460sample.addElement(x);
461```
462A class is a string, contexts may be as complex as you want (as long as it can be serialised).
463
464A sample can be saved to and loaded from a file:
465```javascript
466sample.save('sample.json', function(error, sample) {
467  ...
468});
469```
470A sample can be read from a file as follows.
471
472```javascript
473sample.load('sample.json', MyElementClass, function(err, sample) {
474
475});
476```
477You have to pass the element class to the load method so that the right element objects can be created from the data.
478
479#### Features and feature sets
480Features are functions that map elements to zero or one. Features are defined as follows:
481```javascript
482var Feature = require('Feature');
483
484function f(x) {
485  if (x.b === "0") {
486    return 1;
487  }
488  return 0;
489}
490
491var feature = new Feature(f, name, parameters);
492```
493<code>name</code> is a string for the name of the feature function, <code>parameters</code> is an array of strings for the parameters of the feature function. The combination of name and parameters should uniquely distinguish features from each other. Features that are added to a feature set are tested for uniqueness using these properties.
494
495A feature set is created like this
496```javascript
497var FeatureSet = require('FeatureSet');
498
499var set = new FeatureSet();
500set.addFeature(f, "f", ["0"]);
501```
502
503In most cases you will generate feature functions using closures. For instance, when you generate feature functions in a loop that iterates through an array
504```javascript
505var FeatureSet = require('FeatureSet');
506var Feature = require('Feature');
507
508var listOfTags = ['NN', 'DET', 'PREP', 'ADJ'];
509var featureSet = new FeatureSet();
510
511listofTags.forEach(function(tag) {
512  function isTag(x) {
513    if (x.b.data.tag === tag) {
514      return 1
515    }
516    return 0;
517  }
518  featureSet.addFeature(new Feature(isTag, "isTag", [tag]));
519});
520```
521In this example you create feature functions that each have a different value for <code>tag</code> in their closure.
522
523#### Setting up and training the classifier
524A classifier needs the following parameter:
525* Classes: an array of classes (strings)
526* Features: an array of feature functions
527* Sample: a sample of elements for training the classifier
528
529A classifier can be created as follows:
530```javascript
531var Classifier = require('Classifier');
532var classifier = new Classifier(classes, featureSet, sample);
533```
534And it starts training with:
535```javascript
536var maxIterations = 100;
537var minImprovement = .01;
538var p = classifier.train(maxIterations, minImprovement);
539```
540Training is finished when either <code>maxIterations</code> is reached or the improvement in likelihood (of the sample) becomes smaller than <code>minImprovement</code>. It returns a probability distribution that can be stored and retrieved for later usage:
541```javascript
542classifier.save('classifier.json', function(err, c) {
543  if (err) {
544    console.log(err);
545  }
546  else {
547    // Continue using the classifier
548  }
549});
550
551classifier.load('classifier.json', function(err, c) {
552  if (err) {
553    console.log(err);
554  }
555  else {
556    // Use the classifier
557  }
558});
559```
560
561The training algorithm is based on Generalised Iterative Scaling.
562
563#### Applying the classifier
564The classifier can be used to classify contexts in two ways. To get the probabilities for all classes:
565```javascript
566var classifications = classifier.getClassifications(context);
567classifications.forEach(function(classPlusProbability) {
568  console.log('Class ' + classPlusProbability.label + ' has score ' + classPlusProbability.value);
569});
570```
571This returns a map from classes to probabilities.
572To get the highest scoring class:
573```javascript
574var class = classifier.classify(context);
575console.log(class);
576```
577
578#### Simple example of maximum entropy modelling
579A  test is added to the spec folder based on simple elements that have contexts that are either "0" or "1", and classes are "x" and "y".
580```javascript
581{
582  "a": "x",
583  "b": {
584    "data": "0"
585  }
586}
587```
588In the SE_Element class that inherits from Element, the method generateFeatures is implemented. It creates a feature function that tests for context "0".
589
590After setting up your own element class, the classifier can be created and trained.
591
592#### Application to POS tagging
593A more elaborate example of maximum entropy modelling is provided for part of speech tagging. The following steps are taken to create a classifier and apply it to a test set:
594* A new element class POS_Element is created that has a word window and a tag window around the word to be tagged.
595* From the Brown corpus a sample is generated consisting of POS elements.
596* Feature functions are generated from the sample.
597* A classifier is created and trained.
598* The classifier is applied to a test set. Results are compared to a simple lexicon-based tagger.  
599
600#### References
601* Adwait RatnaParkhi, Maximum Entropy Models For Natural Language Ambiguity Resolution, University of Pennsylvania, 1998, URL: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1061&context=ircs_reports
602* Darroch, J.N.; Ratcliff, D. (1972). Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, Institute of Mathematical Statistics, 43 (5): 1470–1480.
603
604## Sentiment Analysis
605This is a simple sentiment analysis algorithm based on a vocabulary that assigns polarity to words. The algorithm calculates the sentiment of a piece of text by summing the polarity of each word and normalizing with the length of the sentence. If a negation occurs the result is made negative. It is used as follows:
606```javascript
607var Analyzer = require('natural').SentimentAnalyzer;
608var stemmer = require('natural').PorterStemmer;
609var analyzer = new Analyzer("English", stemmer, "afinn");
610// getSentiment expects an array of strings
611console.log(analyzer.getSentiment(["I", "like", "cherries"]));
612// 0.6666666666666666
613```
614The constructor has three parameters:
615* Language: see below for supported languages.
616* Stemmer: to increase the coverage of the sentiment analyzer a stemmer may be provided. May be `null`.
617* Vocabulary: sets the type of vocabulary, `"afinn"`, `"senticon"` or `"pattern"` are valid values.
618
619Currently, the following languages are supported with type of vocabulary and availability of negations (in alphabetic order):
620
621| Language      | AFINN       | Senticon  | Pattern   | Negations |
622| ------------- |:-----------:|:---------:|:---------:|:---------:|
623| Basque        |             |  X        |           |           |
624| Catalan       |             |  X        |           |           |
625| Dutch         |             |           | X         | X         |
626| English       | X           |  X        | X         | X         |
627| French        |             |           | X         |           |
628| Galician      |             |  X        |           |           |   
629| Italian       |             |           | X         |           |
630| Spanish       | X           |  X        |           | X         |     
631
632More languages can be added by adding vocabularies and extending the map `languageFiles` in `SentimentAnalyzer.js`. In the tools folder below `lib/natural/sentiment` some tools are provided for transforming vocabularies in Senticon and Pattern format into a JSON format.
633
634
635
636### Acknowledgements and References
637Thanks to Domingo Martín Mancera for providing the basis for this sentiment analyzer in his repo [Lorca](https://github.com/dmarman/lorca).
638
639AFINN is a list of English words rated for valence with an integer
640between minus five (negative) and plus five (positive). The words have
641been manually labeled by Finn Årup Nielsen in 2009-2011. Scientific reference can be found [here](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010). We used [afinn-165](https://github.com/words/afinn-165) which is available as nodejs module.
642
643The senticon vocabulary is based on work by Fermin L. Cruz and others:
644Cruz, Fermín L., José A. Troyano, Beatriz Pontes, F. Javier Ortega. Building layered, multilingual sentiment lexicons at synset and lemma levels, Expert Systems with Applications, 2014.
645
646The Pattern vocabularies are from the [Pattern project](https://github.com/clips/pattern) of the CLiPS Research Center of University of Antwerpen. These have a PDDL license.
647
648## Phonetics
649Phonetic matching (sounds-like) matching can be done with the [SoundEx](http://en.wikipedia.org/wiki/Soundex),
650[Metaphone](http://en.wikipedia.org/wiki/Metaphone) or [DoubleMetaphone](http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone) algorithms
651
652```javascript
653var natural = require('natural');
654var metaphone = natural.Metaphone;
655var soundEx = natural.SoundEx;
656
657var wordA = 'phonetics';
658var wordB = 'fonetix';
659```
660
661To test the two words to see if they sound alike:
662
663```javascript
664if(metaphone.compare(wordA, wordB))
665    console.log('they sound alike!');
666```
667
668The raw phonetics are obtained with `process()`:
669
670```javascript
671console.log(metaphone.process('phonetics'));
672```
673
674A maximum code length can be supplied:
675
676```javascript
677console.log(metaphone.process('phonetics', 3));
678```
679
680`DoubleMetaphone` deals with two encodings returned in an array. This
681feature is experimental and subject to change:
682
683```javascript
684var natural = require('natural');
685var dm = natural.DoubleMetaphone;
686
687var encodings = dm.process('Matrix');
688console.log(encodings[0]);
689console.log(encodings[1]);
690```
691
692Attaching will patch String with useful methods:
693
694```javascript
695metaphone.attach();
696```
697
698`soundsLike` is essentially a shortcut to `Metaphone.compare`:
699
700```javascript
701if(wordA.soundsLike(wordB))
702    console.log('they sound alike!');
703```
704
705The raw phonetics are obtained with `phonetics()`:
706
707```javascript
708console.log('phonetics'.phonetics());
709```
710
711Full text strings can be tokenized into arrays of phonetics (much like how tokenization-to-arrays works for stemmers):
712
713```javascript
714console.log('phonetics rock'.tokenizeAndPhoneticize());
715```
716
717Same module operations applied with `SoundEx`:
718
719```javascript
720if(soundEx.compare(wordA, wordB))
721    console.log('they sound alike!');
722```
723
724The same String patches apply with `soundEx`:
725
726```javascript
727soundEx.attach();
728
729if(wordA.soundsLike(wordB))
730    console.log('they sound alike!');
731
732console.log('phonetics'.phonetics());
733```
734
735## Inflectors
736
737### Nouns
738
739Nouns can be pluralized/singularized with a `NounInflector`:
740
741```javascript
742var natural = require('natural');
743var nounInflector = new natural.NounInflector();
744```
745
746To pluralize a word (outputs "radii"):
747
748```javascript
749console.log(nounInflector.pluralize('radius'));
750```
751
752To singularize a word (outputs "beer"):
753
754```javascript
755console.log(nounInflector.singularize('beers'));
756```
757
758Like many of the other features, String can be patched to perform the operations
759directly. The "Noun" suffix on the methods is necessary, as verbs will be
760supported in the future.
761
762```javascript
763nounInflector.attach();
764console.log('radius'.pluralizeNoun());
765console.log('beers'.singularizeNoun());
766```
767
768### Numbers
769
770Numbers can be counted with a CountInflector:
771
772```javascript
773var countInflector = natural.CountInflector;
774```
775
776Outputs "1st":
777
778```javascript
779console.log(countInflector.nth(1));
780```
781
782Outputs "111th":
783
784```javascript
785console.log(countInflector.nth(111));
786```
787
788### Present Tense Verbs
789
790Present Tense Verbs can be pluralized/singularized with a PresentVerbInflector.
791This feature is still experimental as of 0.0.42, so use with caution, and please
792provide feedback.
793
794```javascript
795var verbInflector = new natural.PresentVerbInflector();
796```
797
798Outputs "becomes":
799
800```javascript
801console.log(verbInflector.singularize('become'));
802```
803
804Outputs "become":
805
806```javascript
807console.log(verbInflector.pluralize('becomes'));
808```
809
810Like many other natural modules, `attach()` can be used to patch strings with
811handy methods.
812
813```javascript
814verbInflector.attach();
815console.log('walk'.singularizePresentVerb());
816console.log('walks'.pluralizePresentVerb());
817```
818
819
820## N-Grams
821
822n-grams can be obtained for either arrays or strings (which will be tokenized
823for you):
824
825```javascript
826var NGrams = natural.NGrams;
827```
828
829### bigrams
830
831```javascript
832console.log(NGrams.bigrams('some words here'));
833console.log(NGrams.bigrams(['some',  'words',  'here']));
834```
835
836Both of the above output: `[ [ 'some', 'words' ], [ 'words', 'here' ] ]`
837
838### trigrams
839
840```javascript
841console.log(NGrams.trigrams('some other words here'));
842console.log(NGrams.trigrams(['some',  'other', 'words',  'here']));
843```
844
845Both of the above output: `[ [ 'some', 'other', 'words' ],
846  [ 'other', 'words', 'here' ] ]`
847
848### arbitrary n-grams
849
850```javascript
851console.log(NGrams.ngrams('some other words here for you', 4));
852console.log(NGrams.ngrams(['some', 'other', 'words', 'here', 'for',
853    'you'], 4));
854```
855
856The above outputs: `[ [ 'some', 'other', 'words', 'here' ],
857  [ 'other', 'words', 'here', 'for' ],
858  [ 'words', 'here', 'for', 'you' ] ]`
859
860### padding
861
862n-grams can also be returned with left or right padding by passing a start and/or end symbol to the bigrams, trigrams or ngrams.
863
864```javascript
865console.log(NGrams.ngrams('some other words here for you', 4, '[start]', '[end]'));
866```
867
868The above will output:
869```
870[ [ '[start]', '[start]', '[start]', 'some' ],
871  [ '[start]', '[start]', 'some', 'other' ],
872  [ '[start]', 'some', 'other', 'words' ],
873  [ 'some', 'other', 'words', 'here' ],
874  [ 'other', 'words', 'here', 'for' ],
875  [ 'words', 'here', 'for', 'you' ],
876  [ 'here', 'for', 'you', '[end]' ],
877  [ 'for', 'you', '[end]', '[end]' ],
878  [ 'you', '[end]', '[end]', '[end]' ] ]
879```
880
881For only end symbols, pass `null` for the start symbol, for instance:
882```javascript
883console.log(NGrams.ngrams('some other words here for you', 4, null, '[end]'));
884```
885
886Will output:
887```
888[ [ 'some', 'other', 'words', 'here' ],
889  [ 'other', 'words', 'here', 'for' ],
890  [ 'words', 'here', 'for', 'you' ],
891  [ 'here', 'for', 'you', '[end]' ],
892  [ 'for', 'you', '[end]', '[end]' ],
893  [ 'you', '[end]', '[end]', '[end]' ] ]
894```
895
896### NGramsZH
897
898For Chinese like languages, you can use NGramsZH to do a n-gram, and all apis are the same:
899
900```javascript
901var NGramsZH = natural.NGramsZH;
902console.log(NGramsZH.bigrams('中文测试'));
903console.log(NGramsZH.bigrams(['中',  '文',  '测', '试']));
904console.log(NGramsZH.trigrams('中文测试'));
905console.log(NGramsZH.trigrams(['中',  '文', '测',  '试']));
906console.log(NGramsZH.ngrams('一个中文测试', 4));
907console.log(NGramsZH.ngrams(['一', '个', '中', '文', '测',
908    '试'], 4));
909```
910
911## tf-idf
912
913[Term Frequency–Inverse Document Frequency (tf-idf)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is implemented to determine how important a word (or words) is to a
914document relative to a corpus. The following formulas are used for calculating tf and idf:
915* tf(t, d) is a so-called raw count, so just the count of the term in the document
916* idf(t, D) uses the following formula: 1 + ln(N / (1 + n_t)) where N is the number of documents, and n_t the number of documents in which the term appears. The 1 + in the denominator is for handling the possibility that n_t is 0. 
917
918The following example will add four documents to
919a corpus and determine the weight of the word "node" and then the weight of the
920word "ruby" in each document.
921
922```javascript
923var natural = require('natural');
924var TfIdf = natural.TfIdf;
925var tfidf = new TfIdf();
926
927tfidf.addDocument('this document is about node.');
928tfidf.addDocument('this document is about ruby.');
929tfidf.addDocument('this document is about ruby and node.');
930tfidf.addDocument('this document is about node. it has node examples');
931
932console.log('node --------------------------------');
933tfidf.tfidfs('node', function(i, measure) {
934    console.log('document #' + i + ' is ' + measure);
935});
936
937console.log('ruby --------------------------------');
938tfidf.tfidfs('ruby', function(i, measure) {
939    console.log('document #' + i + ' is ' + measure);
940});
941```
942
943The above outputs:
944
945```
946node --------------------------------
947document #0 is 1
948document #1 is 0
949document #2 is 1
950document #3 is 2
951ruby --------------------------------
952document #0 is 0
953document #1 is 1.2876820724517808
954document #2 is 1.2876820724517808
955document #3 is 0
956```
957
958This approach can also be applied to individual documents.
959
960The following example measures the term "node" in the first and second documents.
961
962```javascript
963console.log(tfidf.tfidf('node', 0));
964console.log(tfidf.tfidf('node', 1));
965```
966
967A TfIdf instance can also load documents from files on disk.
968
969```javascript
970var tfidf = new TfIdf();
971tfidf.addFileSync('data_files/one.txt');
972tfidf.addFileSync('data_files/two.txt');
973```
974
975Multiple terms can be measured as well, with their weights being added into
976a single measure value. The following example determines that the last document
977is the most relevant to the words "node" and "ruby".
978
979```javascript
980var natural = require('natural');
981var TfIdf = natural.TfIdf;
982var tfidf = new TfIdf();
983
984tfidf.addDocument('this document is about node.');
985tfidf.addDocument('this document is about ruby.');
986tfidf.addDocument('this document is about ruby and node.');
987
988tfidf.tfidfs('node ruby', function(i, measure) {
989    console.log('document #' + i + ' is ' + measure);
990});
991```
992
993The above outputs:
994
995```
996document #0 is 1
997document #1 is 1
998document #2 is 2
999```
1000
1001The examples above all use strings, which causes natural to automatically tokenize the input.
1002If you wish to perform your own tokenization or other kinds of processing, you
1003can do so, then pass in the resultant arrays later. This approach allows you to bypass natural's
1004default preprocessing.
1005
1006```javascript
1007var natural = require('natural');
1008var TfIdf = natural.TfIdf;
1009var tfidf = new TfIdf();
1010
1011tfidf.addDocument(['document', 'about', 'node']);
1012tfidf.addDocument(['document', 'about', 'ruby']);
1013tfidf.addDocument(['document', 'about', 'ruby', 'node']);
1014tfidf.addDocument(['document', 'about', 'node', 'node', 'examples']);
1015
1016tfidf.tfidfs(['node', 'ruby'], function(i, measure) {
1017    console.log('document #' + i + ' is ' + measure);
1018});
1019```
1020
1021It's possible to retrieve a list of all terms in a document, sorted by their
1022importance.
1023
1024```javascript
1025tfidf.listTerms(0 /*document index*/).forEach(function(item) {
1026    console.log(item.term + ': ' + item.tfidf);
1027});
1028```
1029
1030A TfIdf instance can also be serialized and deserialized for save and recall.
1031
1032```javascript
1033var tfidf = new TfIdf();
1034tfidf.addDocument('document one', 'un');
1035tfidf.addDocument('document Two', 'deux');
1036var s = JSON.stringify(tfidf);
1037// save "s" to disk, database or otherwise
1038
1039// assuming you pulled "s" back out of storage.
1040var tfidf = new TfIdf(JSON.parse(s));
1041```
1042
1043## Tries
1044
1045Tries are a very efficient data structure used for prefix-based searches.
1046Natural comes packaged with a basic Trie implementation which can support match collection along a path,
1047existence search and prefix search.
1048
1049### Building The Trie
1050
1051You need to add words to build up the dictionary of the Trie, this is an example of basic Trie set up:
1052
1053```javascript
1054var natural = require('natural');
1055var Trie = natural.Trie;
1056
1057var trie = new Trie();
1058
1059// Add one string at a time
1060trie.addString("test");
1061
1062// Or add many strings
1063trie.addStrings(["string1", "string2", "string3"]);
1064```
1065
1066### Searching
1067
1068#### Contains
1069
1070The most basic operation on a Trie is to see if a search string is marked as a word in the Trie.
1071
1072```javascript
1073console.log(trie.contains("test")); // true
1074console.log(trie.contains("asdf")); // false
1075```
1076
1077### Find Prefix
1078
1079The find prefix search will find the longest prefix that is identified as a word in the trie.
1080It will also return the remaining portion of the string which it was not able to match.
1081
1082```javascript
1083console.log(trie.findPrefix("tester"));     // ['test', 'er']
1084console.log(trie.findPrefix("string4"));    // [null, '4']
1085console.log(trie.findPrefix("string3"));    // ['string3', '']
1086```
1087
1088### All Prefixes on Path
1089
1090This search will return all prefix matches along the search string path.
1091
1092```javascript
1093trie.addString("tes");
1094trie.addString("est");
1095console.log(trie.findMatchesOnPath("tester")); // ['tes', 'test'];
1096```
1097
1098### All Keys with Prefix
1099
1100This search will return all of the words in the Trie with the given prefix, or [ ] if not found.
1101
1102```javascript
1103console.log(trie.keysWithPrefix("string")); // ["string1", "string2", "string3"]
1104```
1105
1106### Case-Sensitivity
1107
1108By default the trie is case-sensitive, you can use it in case-_in_sensitive mode by passing `false`
1109to the Trie constructor.
1110
1111```javascript
1112trie.contains("TEST"); // false
1113
1114var ciTrie = new Trie(false);
1115ciTrie.addString("test");
1116ciTrie.contains("TEsT"); // true
1117```
1118In the case of the searches which return strings, all strings returned will be in lower case if you are in case-_in_sensitive mode.
1119
1120## EdgeWeightedDigraph
1121
1122EdgeWeightedDigraph represents a digraph, you can add an edge, get the number vertexes, edges, get all edges and use toString to print the Digraph.
1123
1124initialize a digraph:
1125
1126```javascript
1127var EdgeWeightedDigraph = natural.EdgeWeightedDigraph;
1128var digraph = new EdgeWeightedDigraph();
1129digraph.add(5,4,0.35);
1130digraph.add(5,1,0.32);
1131digraph.add(1,3,0.29);
1132digraph.add(6,2,0.40);
1133digraph.add(3,6,0.52);
1134digraph.add(6,4,0.93);
1135```
1136the api used is: add(from, to, weight).
1137
1138get the number of vertexes:
1139
1140```javascript
1141console.log(digraph.v());
1142```
1143you will get 7.
1144
1145get the number of edges:
1146
1147```javascript
1148console.log(digraph.e());
1149```
1150you will get 6.
1151
1152
1153
1154## ShortestPathTree
1155
1156ShortestPathTree represents a data type for solving the single-source shortest paths problem in
1157edge-weighted directed acyclic graphs (DAGs).
1158The edge weights can be positive, negative, or zero. There are three APIs:
1159getDistTo(vertex),
1160hasPathTo(vertex),
1161pathTo(vertex).
1162
1163```javascript
1164var ShortestPathTree = natural.ShortestPathTree;
1165var spt = new ShortestPathTree(digraph, 5);
1166```
1167digraph is an instance of EdgeWeightedDigraph, the second param is the start vertex of DAG.
1168
1169### getDistTo(vertex)
1170
1171Will return the dist to vertex.
1172
1173```javascript
1174console.log(spt.getDistTo(4));
1175```
1176the output will be: 0.35
1177
1178### pathTo(vertex)
1179
1180Will return the shortest path:
1181
1182```javascript
1183console.log(spt.pathTo(4));
1184```
1185
1186output will be:
1187
1188```javascript
1189[5, 4]
1190```
1191
1192## LongestPathTree
1193
1194LongestPathTree represents a data type for solving the single-source longest paths problem in
1195edge-weighted directed acyclic graphs (DAGs).
1196The edge weights can be positive, negative, or zero. There are three APIs same as ShortestPathTree:
1197getDistTo(vertex),
1198hasPathTo(vertex),
1199pathTo(vertex).
1200
1201```javascript
1202var LongestPathTree = natural.LongestPathTree;
1203var lpt = new LongestPathTree(digraph, 5);
1204```
1205digraph is an instance of EdgeWeightedDigraph, the second param is the start vertex of DAG.
1206
1207### getDistTo(vertex)
1208
1209Will return the dist to vertex.
1210
1211```javascript
1212console.log(lpt.getDistTo(4));
1213```
1214the output will be: 2.06
1215
1216### pathTo(vertex)
1217
1218Will return the longest path:
1219
1220```javascript
1221console.log(lpt.pathTo(4));
1222```
1223
1224output will be:
1225
1226```javascript
1227[5, 1, 3, 6, 4]
1228```
1229
1230## WordNet
1231
1232One of the newest and most experimental features in natural is WordNet integration. Here's an
1233example of using natural to look up definitions of the word node. To use the WordNet module,
1234first install the WordNet database files using [wordnet-db](https://github.com/moos/wordnet-db):
1235
1236    npm install wordnet-db
1237
1238Keep in mind that the WordNet integration is to be considered experimental at this point,
1239and not production-ready. The API is also subject to change.  For an implementation with vastly increased performance, as well as a command-line interface, see [wordpos](https://github.com/moos/wordpos).
1240
1241Here's an example of looking up definitions for the word "node".
1242
1243```javascript
1244var wordnet = new natural.WordNet();
1245
1246wordnet.lookup('node', function(results) {
1247    results.forEach(function(result) {
1248        console.log('------------------------------------');
1249        console.log(result.synsetOffset);
1250        console.log(result.pos);
1251        console.log(result.lemma);
1252        console.log(result.synonyms);
1253        console.log(result.pos);
1254        console.log(result.gloss);
1255    });
1256});
1257```
1258
1259Given a synset offset and a part of speech, a definition can be looked up directly.
1260
1261```javascript
1262var wordnet = new natural.WordNet();
1263
1264wordnet.get(4424418, 'n', function(result) {
1265    console.log('------------------------------------');
1266    console.log(result.lemma);
1267    console.log(result.pos);
1268    console.log(result.gloss);
1269    console.log(result.synonyms);
1270});
1271```
1272
1273If you have _manually_ downloaded the WordNet database files, you can pass the folder to the constructor:
1274
1275```javascript
1276var wordnet = new natural.WordNet('/my/wordnet/dict');
1277```
1278
1279As of v0.1.11, WordNet data files are no longer automatically downloaded.
1280
1281Princeton University "About WordNet." WordNet. Princeton University. 2010. <http://wordnet.princeton.edu>
1282
1283## Spellcheck
1284
1285A probabilistic spellchecker based on http://norvig.com/spell-correct.html
1286
1287This is best constructed with an array of tokens from a corpus, but a simple list of words from a dictionary will work.
1288
1289```javascript
1290var corpus = ['something', 'soothing'];
1291var spellcheck = new natural.Spellcheck(corpus);
1292```
1293
1294It uses the trie datastructure for fast boolean lookup of a word
1295
1296```javascript
1297spellcheck.isCorrect('cat'); // false
1298```
1299
1300It suggests corrections (sorted by probability in descending order) that are up to a maximum edit distance away from the input word. According to Norvig, a max distance of 1 will cover 80% to 95% of spelling mistakes. After a distance of 2, it becomes very slow.
1301
1302```javascript
1303spellcheck.getCorrections('soemthing', 1); // ['something']
1304spellcheck.getCorrections('soemthing', 2); // ['something', 'soothing']
1305```
1306
1307
1308## POS Tagger
1309
1310This is a part-of-speech tagger based on Eric Brill's transformational
1311algorithm. It needs a lexicon and a set of transformation rules.
1312
1313
1314### Usage
1315
1316First a lexicon is created. First parameter is language (<code>EN</code> for English and <code>DU</code> for Dutch), second is default category. 
1317Optionally, a third parameter can be supplied that is the default category for capitalised words. 
1318```javascript
1319var natural = require("natural");
1320const language = "EN"
1321const defaultCategory = 'N';
1322const defaultCategoryCapitalized = 'NNP';
1323
1324var lexicon = new natural.Lexicon(language, defaultCategory, defaultCategoryCapitalized);
1325var ruleSet = new natural.RuleSet('EN');
1326var tagger = new natural.BrillPOSTagger(lexicon, ruleSet);
1327```
1328
1329Then a ruleset is created, as follows. Parameter is the language.
1330```javascript
1331var ruleSet = new natural.RuleSet('EN');
1332```
1333Now a tagger can be created by passing lexicon and ruleset:
1334```javascript
1335var tagger = new natural.BrillPOSTagger(lexicon, ruleSet);
1336var sentence = ["I", "see", "the", "man", "with", "the", "telescope"];
1337console.log(tagger.tag(sentence));
1338```
1339
1340This outputs the following:
1341```javascript
1342Sentence {
1343  taggedWords:
1344   [ { token: 'I', tag: 'NN' },
1345     { token: 'see', tag: 'VB' },
1346     { token: 'the', tag: 'DT' },
1347     { token: 'man', tag: 'NN' },
1348     { token: 'with', tag: 'IN' },
1349     { token: 'the', tag: 'DT' },
1350     { token: 'telescope', tag: 'NN' } ] }
1351```
1352
1353### Lexicon
1354The lexicon is a JSON file that has the following structure:
1355```javascript
1356{
1357  "word1": ["cat1"],
1358  "word2": ["cat2", "cat3"],
1359  ...
1360}
1361```
1362
1363Words may have multiple categories in the lexicon file. The tagger uses only
1364the first category specified.
1365
1366
1367### Specifying transformation rules
1368Transformation rules are specified a JSON file as follows:
1369```javascript
1370{
1371  "rules": [
1372    "OLD_CAT NEW_CAT PREDICATE PARAMETER",
1373    ...
1374  ]
1375}
1376```
1377This particular means that if the category of the current position is OLD_CAT and the predicate is true, the category is replaced by NEW_CAT. The predicate
1378may use the parameter in different ways: sometimes the parameter is used for
1379specifying the outcome of the predicate:
1380```
1381NN CD CURRENT-WORD-IS-NUMBER YES
1382```
1383This means that if the outcome of predicate CURRENT-WORD-IS-NUMBER is YES, the
1384category is replaced by <code>CD</code>.
1385The parameter can also be used to check the category of a word in the sentence:
1386```
1387VBD NN PREV-TAG DT
1388```
1389Here the category of the previous word must be <code>DT</code> for the rule to be applied.
1390
1391
1392### Algorithm
1393The tagger applies transformation rules that may change the category of words. The input sentence is a Sentence object with tagged words. The tagged sentence is processed from left to right. At each step all rules are applied once; rules are applied in the order in which they are specified. Algorithm:
1394```javascript
1395Brill_POS_Tagger.prototype.applyRules = function(sentence) {
1396  for (var i = 0, size = sentence.taggedWords.length; i < size; i++) {
1397    this.ruleSet.getRules().forEach(function(rule) {
1398      rule.apply(sentence, i);
1399    });
1400  }
1401  return sentence;
1402};
1403```
1404The output is a Sentence object just like the input sentence.
1405
1406
1407### Adding a predicate
1408Predicates are defined in module <code>lib/RuleTemplates.js</code>. In that file
1409predicate names are mapped to metadata for generaring transformation rules. The following properties must be supplied:
1410* Name of the predicate
1411* A function that evaluates the predicate (should return a boolean)
1412* A window <code>[i, j]</code> that defines the span of the predicate in the
1413sentence relative to the current position
1414* The number of parameter the predicate needs: 0, 1 or 2
1415* If relevant, a function for parameter 1 that returns its possible values
1416at the current position in the sentence (for generating rules in training)
1417* If relevant, a function for parameter 2 that returns its possible values
1418at the current position in the sentence (for training)
1419
1420A typical entry for a rule templates looks like this:
1421```javascript
1422"NEXT-TAG": {
1423    // maps to the predicate function
1424    "function": next_tag_is,
1425    // Minimum required window before or after current position to be a relevant predicate
1426    "window": [0, 1],
1427    // The number of parameters the predicate takes
1428    "nrParameters": 1,
1429    // Function that returns relevant values for parameter 1
1430    "parameter1Values": nextTagParameterValues
1431  }
1432```
1433A predicate function accepts a Sentence object, the current position in the
1434sentence that should be tagged, and the outcome(s) of the predicate.
1435An example of a predicate that checks the category of the current word:
1436```javascript
1437function next_tag_is(sentence, i, parameter) {
1438  if (i < sentence.taggedWords.length - 1) {
1439    return(sentence.taggedWords[i + 1][1] === parameter);
1440  }
1441  else {
1442    return(false);
1443  }
1444}
1445```
1446
1447A values function for a parameter returns an array all possible parameter
1448values given a location in a tagged sentence.
1449```javascript
1450function nextTagParameterValues(sentence, i) {
1451  if (i < sentence.length - 1) {
1452    return [sentence[i + 1].tag];
1453  }
1454  else {
1455    return [];
1456  }
1457}
1458```
1459
1460### Training
1461The trainer allows to learn a new set of transformation rules from a corpus.
1462It takes as input a tagged corpus and a set of rule templates. The algorithm
1463generates positive rules (rules that apply at some location in the corpus)
1464from the templates and iteratively extends and optimises the rule set.
1465
1466First, a corpus should be loaded. Currently, the format of Brown corpus is supported. Then a lexicon can be created from the corpus. The lexicon is needed for tagging the sentences before the learning algorithm is applied.
1467```javascript
1468var natural = require(natural);
1469const JSON_FLAG = 2;
1470
1471var brownCorpus = require('../lib/natural/brill_pos_tagger/lib/Corpus');
1472var corpus = new Corpus(brownCorpus, JSON_FLAG, natural.Sentence);
1473var lexicon = corpus.buildLexicon();
1474```
1475The next step is to create a set of rule templates from which the learning
1476algorithm can generate transformation rules. Rule templates are defined in
1477<code>PredicateMapping.js</code>.
1478```javascript
1479var templateNames = [
1480  "NEXT-TAG",
1481  "NEXT-WORD-IS-CAP",
1482  "PREV-1-OR-2-OR-3-TAG",
1483  "...",
1484];
1485var templates = templateNames.map(function(name) {
1486  return new natural.RuleTemplate(name);
1487});
1488```
1489Using lexicon and rule templates we can now start the trainer as follows.
1490```javascript
1491var trainer = new natural.BrillPOSTrainer(/* optional threshold */);
1492var ruleSet = trainer.train(corpus, templates, lexicon);
1493```
1494A threshold value can be passed to constructor. Transformation rules with
1495a score below the threshold are removed after training.
1496The train method returns a set of transformation rules that can be used to
1497create a POS tagger as usual. Also you can output the rule set in the right
1498format for later usage.
1499```javascript
1500console.log(ruleSet.prettyPrint());
1501```
1502
1503
1504### Testing
1505Now we can apply the lexicon and rule set to a test set.
1506```javascript
1507var tester = new natural.BrillPOSTester();
1508var tagger = new natural.BrillPOSTagger(lexicon, ruleSet);
1509var scores = tester.test(corpora[1], tagger);
1510```
1511The test method returns an array of two percentages: first percentage is the ratio of right tags after tagging with the lexicon; second percentage is the ratio of right tags after applying the transformation rules.
1512```javascript
1513console.log("Test score lexicon " + scores[0] + "%");
1514console.log("Test score after applying rules " + scores[1] + "%");
1515```
1516
1517
1518### Acknowledgements and References
1519* Part of speech tagger by Percy Wegmann, https://code.google.com/p/jspos/
1520* Node.js version of jspos: https://github.com/neopunisher/pos-js
1521* A simple rule-based part of speech tagger, Eric Brill, Published in: Proceeding ANLC '92 Proceedings of the third conference on Applied natural language processing, Pages 152-155. http://dl.acm.org/citation.cfm?id=974526
1522* Exploring the Statistical Derivation of Transformational Rule Sequences for Part-of-Speech Tagging, Lance A. Ramshaw and Mitchell P. Marcus. http://acl-arc.comp.nus.edu.sg/archives/acl-arc-090501d4/data/pdf/anthology-PDF/W/W94/W94-0111.pdf
1523* Brown Corpus, https://en.wikipedia.org/wiki/Brown_Corpus
1524
1525## Development
1526
1527When developing, please:
1528
1529+ Write unit tests for jasmine
1530+ Make sure your unit tests pass
1531+ Do not use the file system <code>fs</code>. If you need to read files, use JSON and <code>require</code>.
1532
1533The current configuration of the unit tests requires the following environment variable to be set:
1534```javascript
1535    export NODE_PATH=.
1536````
1537
1538License
1539-------
1540Copyright (c) 2011, 2012 Chris Umbel, Rob Ellis, Russell Mull
1541
1542Permission is hereby granted, free of charge, to any person obtaining a copy
1543of this software and associated documentation files (the "Software"), to deal
1544in the Software without restriction, including without limitation the rights
1545to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
1546copies of the Software, and to permit persons to whom the Software is
1547furnished to do so, subject to the following conditions:
1548
1549The above copyright notice and this permission notice shall be included in
1550all copies or substantial portions of the Software.
1551
1552THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
1553IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
1554FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
1555AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
1556LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
1557OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
1558THE SOFTWARE.
1559
1560WordNet License
1561---------------
1562This license is available as the file LICENSE in any downloaded version of WordNet.
1563WordNet 3.0 license: (Download)
1564
1565WordNet Release 3.0 This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have read, understood, and will comply with these terms and conditions.: Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.
1566
1567
\No newline at end of file