Apache OpenNLP

4 minute read

In this article we will explore how you might integrate an NLP, natural language processing, library into your product – focusing on the Java Apache OpenNLP library. First off, let’s discuss how your product might use an NLP library. Examples would be when your product consumes: 1. textual information from a third party system; 2. user generated textual information. An NLP library would enable you to programmatically understand the text – the language (English, German, Italian, etc), the sentences that make up the text and the parts of speech of each word. Parts of speech, in particular, are interesting because you could parse out all proper nouns or nouns in the text for use in labeling or tagging workflows.

Text normalization

Before parsing your text, it is important to “normalize” it, which means to remove characters that are not words. There are a set of standard normalizers that are included in OpenNLP. The code below is essentially a normalization processing pipeline using the standard normalizers:

// Normalizer for emojis => removes emojis from text
text = (String) EmojiCharSequenceNormalizer.getInstance().normalize(text);
LOGGER.info("Emoji char normalized Text\n{}", text);

// Normalizer for numbers => removes numbers from text
text = (String) NumberCharSequenceNormalizer.getInstance().normalize(text);
LOGGER.info("Number char normalized Text\n{}", text);

// Normalizer to shrink repeated spaces / chars => collapses repeated spaces down to single
text = (String) ShrinkCharSequenceNormalizer.getInstance().normalize(text);
LOGGER.info("Shrink char normalized Text\n{}", text);

// Normalizer for Twitter character sequences => removes #hashtags and @usernames
text = (String) TwitterCharSequenceNormalizer.getInstance().normalize(text);
LOGGER.info("Url char normalized Text\n{}", text);

// Normalizer that removes URls and email addresses.
text = (String) UrlCharSequenceNormalizer.getInstance().normalize(text);
LOGGER.info("Url char normalized Text\n{}", text);

This same logic could be combined using the AggregateCharSequenceNormalizer, which just runs the passed-in set of normalizers in sequential order.

text = (String) new AggregateCharSequenceNormalizer(
       EmojiCharSequenceNormalizer.getInstance(),
       NumberCharSequenceNormalizer.getInstance(),
       ShrinkCharSequenceNormalizer.getInstance(),
       UrlCharSequenceNormalizer.getInstance()
   ).normalize(text);

Use these standard normalizers based on your workflow and/or add your own normalization processes.

Sentence parsing

For sentence parsing, two different classes are used: SentenceModel and SentenceDetectorME, where ME stands for maximum entropy. Notice that the bolded line below, which creates the SentenceModel object, passes in an InputStream object. This InputStream references a .bin file, which is also bolded. The .bin file is a pre-trained NLP model, which is standard to OpenNLP. There are standard pre-trained models for a number of common languages, downloadable from the OpenNLP website.

String NLP_SENTENCE_MODEL = "/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin";
Resource sentenceResource = new ClassPathResource(NLP_SENTENCE_MODEL);
try (InputStream sentenceModelIS = sentenceResource.getInputStream()) {

   SentenceModel model = new SentenceModel(sentenceModelIS);
   SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

   for (String sentence : sentenceDetector.sentDetect(text)) {
       LOGGER.info("Sentence\n{}", sentence);
   }
}

Parts of speech (POS) parsing

For parts of speech parsing, the input to this process are sentences, where these sentences are parsed from the text by OpenNLP as described above. Parts of speech parsing, with your sentences as inputs, uses 4 different classes: TokenizerModel and TokenizerME; POSModel and POSTaggerME. Notice that the bolded lines below, which create the TokenizerModel and POSModel objects, pass in InputStream objects. These InputStreams reference .bin files, which are also bolded. The .bin files are pre-trained NLP models, which are standard to OpenNLP. There are standard pre-trained models for a number of common languages, downloadable from the OpenNLP website.

In the code below, we use the tokenizer to tokenize the sentence (break it up into its discrete parts), then the parts of speech tagger to tag (noun, proper noun, verb, etc) these tokens. The parts of speech tagger also gives us a probability number for each tag, which gives us an indication of how confident the model was of its part of speech assignments.

String NLP_TOKEN_MODEL = "/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin";
String NLP_POS_MODEL = "/models/opennlp-en-ud-ewt-pos-1.0-1.9.3.bin";

Resource sentenceResource = new ClassPathResource(NLP_SENTENCE_MODEL);
Resource tokenResource = new ClassPathResource(NLP_TOKEN_MODEL);
Resource posResource = new ClassPathResource(NLP_POS_MODEL);
try (InputStream sentenceModelIS = sentenceResource.getInputStream();
    InputStream tokenModelIS = tokenResource.getInputStream();
    InputStream posModelIS = posResource.getInputStream()) {

   SentenceModel model = new SentenceModel(sentenceModelIS);
   SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

   TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIS);
   TokenizerME tokenizer = new TokenizerME(tokenizerModel);

   POSModel posModel = new POSModel(posModelIS);
   POSTaggerME posTagger = new POSTaggerME(posModel);

   for (String sentence : sentenceDetector.sentDetect(text)) {
       String[] tokens = tokenizer.tokenize(sentence);
       String[] tags = posTagger.tag(tokens);
       double[] tagProbs = posTagger.probs();

       for (int i = 0; i < tags.length; i++) {
           double tagProb = tagProbs[i];
           
           String token = tokens[i];
           String tag = tags[i];

           if ("NOUN".equals(tag)) {
               LOGGER.info("Noun token '{}', tag {} prob {}", token, tag, tagProb);
           }
           if ("PROPN".equals(tag)) {
               LOGGER.info("Proper noun token '{}', tag {} prob {}", token, tag, tagProb);
           }
           if ("VERB".equals(tag)) {
               LOGGER.info("Verb token '{}', tag {} prob {}", token, tag, tagProb);
           }
       }
   }

Twitter

LinkedIn