Day 14: Stanford NER--How To Setup Your Own Name, Entity, and Recognition Server in the Cloud

I am not a huge fan of machine learning or natural text processing (NLP) but I always have ideas in mind which require them. The idea that I will explore during this post is the ability to build a real time job search engine using twitter data. Tweets will contain the name of the company which if offering a job, the location of the job, and name of the contact person at the company. This requires us to parse the tweet for Person, Location, and Organisation. This type of problem falls under Named Entity Recognition.

According to wikipedia,

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organisations, locations, expressions of times, quantities, monetary values, percentages, etc.

To make it more clear, let us take an example. Suppose we have the following tweet

A human can easily figure out that an organisation named PSI Pax has an opening in Baltimore. But how we can do this programmatically? The easiest way to do this is to maintain a list of all organisations and locations and search through it. However, implementing this solution will not scale.

Today, in this blog post, I will cover how we can set up our own NER server using the Stanford NER package.

What is Stanford NER?

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in text which are the names of things, such as person and company names, or gene and protein names.

Prerequisite

  1. Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift support OpenJDK 6 and 7.

  2. Download the Stanford NER package from the official website.

  3. Sign up for an OpenShift Account. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

  4. Install the rhc client tool on your machine. RHC is a ruby gem so you need to have ruby 1.8.7 or above on your machine. To install rhc, just type

sudo gem install rhc

If you already have one, make sure it is the latest one. To update your rhc, execute the command shown below.

sudo gem update rhc

For additional assistance setting up the rhc command-line tool, see the following page: https://openshift.redhat.com/community/developers/rhc-client-tools-install

  1. Setup your OpenShift account using rhc setup command. This command will help you create a namespace and upload your ssh keys to OpenShift server.

Step1 : Create a JBoss EAP application

We will start with creating the demo application. The name of the application is nerdemo.

$ rhc create-app nerdemo jbosseap

If you have access to medium gears then you can use following command.

$ rhc create-app nerdemo jbosseap -g medium

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. OpenShift will also setup a private git repository for us and clone the repository to the local system. Finally, OpenShift will propagate the DNS to the outside world. The application will be accessible at http://nerdemo-{domain-name}.rhcloud.com/. Replace domain-name with your own unique OpenShift domain name (also sometimes called a namespace).

Step 2 : Add Maven dependency

In the pom.xml file add the following dependency:

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.2.0</version>
</dependency>

Also update the maven project to Java 7 by updating a couple of properties in the pom.xml file:

<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>

Now update the Maven project Right click > Maven > Update Project.

Step 3 : Enable CDI

We will be using CDI for dependency injection. CDI or Context and Dependency injection is a Java EE 6 specification which enables dependency injection in a Java EE 6 project. CDI defines type-safe dependency injection mechanism for Java EE. Almost any POJO can be injected as a CDI bean.

Create a new xml file named beans.xml in the src/main/webapp/WEB-INF folder. Replace the content of beans.xml with the following:

<beans xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/beans_1_0.xsd">
 
</beans>

Step 4 : Application Scoped Classifier Bean

Now we can create an ApplicationScoped bean which will create the instance of CRFClassifier. This classifier is used to detect name, location, and organization from text.

package com.nerdemo;
 
import javax.annotation.PostConstruct;
import javax.enterprise.context.ApplicationScoped;
import javax.enterprise.inject.Produces;
import javax.inject.Named;
 
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
 
@ApplicationScoped
public class ClassifierConfig {
 
    private String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";
    private CRFClassifier<CoreLabel> classifier;
 
    @PostConstruct
    public void postConstruct() {
        CRFClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
        this.classifier = classifier;
    }
 
    @Produces
    @Named
    public CRFClassifier<CoreLabel> classifier() {
        return classifier;
    }
}

Copy the english.all.3class.distsim.crf.ser.gz classifier from the Stanford NER download package to the src/main/resources/classifiers folder.

Step 5 : Enable JAX-RS

To enable JAX-RS, create a class which extends javax.ws.rs.core.Application and specify the application path using javax.ws.rs.ApplicationPath annotation as shown below.

package com.nerdemo;
 
import javax.ws.rs.ApplicationPath;
import javax.ws.rs.core.Application;
 
@ApplicationPath("/api/v1")
public class JaxrsInitializer extends Application{
 
 
}

Step 6 : Create ClassifyRestResource

Now we will create our ClassifyRestResource which will return a NER result. Create a new class ClassifyRestResource and replace the code with the contents shown below:

package com.nerdemo;
 
import java.util.ArrayList;
import java.util.List;
 
import javax.inject.Inject;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.PathParam;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
 
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
 
@Path("/classify")
public class ClassifierRestResource {
 
    @Inject
    private CRFClassifier<CoreLabel> classifier;
 
    @GET
    @Path(value = "/{text}")
    @Produces(value = MediaType.APPLICATION_JSON)
    public List<Result> findNer(@PathParam("text") String text) {
        List<List<CoreLabel>> classify = classifier.classify(text);
        List<Result> results = new ArrayList<>();
        for (List<CoreLabel> coreLabels : classify) {
            for (CoreLabel coreLabel : coreLabels) {
                String word = coreLabel.word();
                String answer = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);
                if(!"O".equals(answer)){
                    results.add(new Result(word, answer));
                }
 
            }
        }
        return results;
    }
}

Deploy to OpenShift

Finally, deploy the changes to OpenShift

$ git add .
$ git commit -am "NER demo app"
$ git push

After the code is pushed and the war is successfully deployed, we can view the application running at http://nerdemo-{domain-name}.rhcloud.com. My sample application is running at http://nerdemo-t20.rhcloud.com.

Now make a request http://nerdemo-t20.rhcloud.com/api/v1/classify/Microsoft%20SCCM%20Windows%20Server%202012%20Web%20Development%20Expert%20(SME3)%20at%20PSI%20Pax%20(Baltimore,%20MD)

and you will get a JSON response as

[
{"word":"Microsoft","answer":"ORGANIZATION"},
{"word":"PSI","answer":"ORGANIZATION"},
{"word":"Pax","answer":"ORGANIZATION"},
{"word":"Baltimore","answer":"LOCATION"}
]

That's it for today. Keep giving feedback.

What's Next

Tags: