Day 18: BoilerPipe--Article Extraction for Java Developers

Today for my 30 day challenge, I decided to learn how to do text and image extraction from web links using the Java programming language. This is a very common requirement in most of the content discovery websites like Prismatic. In this blog, we will learn how we can use a Java library called boilerpipe to accomplish this task.

Prerequisite

  1. Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift supports both OpenJDK 6 and 7.

  2. Sign up for an OpenShift Account.Today for my 30 day challenge, I decided to learn how to do text and image extraction from web links using the Java programming language. This is a very common requirement in most of the content discovery websites like Prismatic. In this blog, we will learn how we can use a Java library called boilerpipe to accomplish this task.

Prerequisite

  1. Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift supports both OpenJDK 6 and 7.

  2. Sign up for an OpenShift Account. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

  3. Install the rhc client tool on your machine. RHC is a ruby gem so you need to have ruby 1.8.7 or above on your machine. To install rhc, just typesudo gem install rhc If you already have one, make sure it is the latest one. To update your rhc, execute the command sudo gem update rhc For additional assistance setting up the rhc command-line tool, see the following page: https://www.openshift.com/developers/rhc-client-tools-install

  4. Setup your OpenShift account using the rhc setup command. This command will help you create a namespace and upload your ssh keys to OpenShift server.

Step1 : Create a JBoss EAP application

We will start with creating the demo application. The name of the application is newsapp.

$ rhc create-app newsapp jbosseap

If you have access to medium gears then you can use following command.

$ rhc create-app newsapp jbosseap -g medium

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. OpenShift will also setup a private git repository for us and clone the repository to the local system. Finally, OpenShift will propagate the DNS to the outside world. The application will be accessible at http://newsapp-{domain-name}.rhcloud.com/. Replace domain-name with your own unique OpenShift domain name (also sometimes called a namespace).

Step 2 : Add Maven dependencies

In the pom.xml file add the following dependency:

<dependency>
    <groupId>de.l3s.boilerpipe</groupId>
    <artifactId>boilerpipe</artifactId>
    <version>1.2.0</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.9.1</version>
</dependency>
 
<dependency>
    <groupId>net.sourceforge.nekohtml</groupId>
    <artifactId>nekohtml</artifactId>
    <version>1.9.13</version>
</dependency>

You will also need to add a new repository

<repository>
    <id>boilerpipe-m2-repo</id>
    <url>http://boilerpipe.googlecode.com/svn/repo/</url>
    <releases>
        <enabled>true</enabled>
    </releases>
    <snapshots>
        <enabled>false</enabled>
    </snapshots>
</repository>

Also update the maven project to Java 7 by updating a couple of properties in the pom.xml file:

<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>

Now update the Maven project Right click > Maven > Update Project.

Step 3 : Enable CDI

We will be using CDI for dependency injection. CDI or Context and Dependency injection is a Java EE 6 specification which enables dependency injection in a Java EE 6 project. CDI defines type-safe dependency injection mechanism for Java EE. Almost any POJO can be injected as a CDI bean.

Create a new xml file named beans.xml in the src/main/webapp/WEB-INF folder. Replace the content of beans.xml with the following:

<beans xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/beans_1_0.xsd">
 
</beans>

Step 4 : Create BoilerpipeContentExtractionService

Now we can create an BoilerpipeContentExtractionService service class which will take a url and find the title and article text from it.

import java.net.URL;
import java.util.Collections;
import java.util.List;
 
import com.newsapp.boilerpipe.image.Image;
import com.newsapp.boilerpipe.image.ImageExtractor;
 
import de.l3s.boilerpipe.BoilerpipeExtractor;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;
 
public class BoilerpipeContentExtractionService {
 
    public Content content(String url) {
        try {
            final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
            final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
            String title = doc.getTitle();
 
            String content = ArticleExtractor.INSTANCE.getText(doc);
 
            final BoilerpipeExtractor extractor = CommonExtractors.KEEP_EVERYTHING_EXTRACTOR;
            final ImageExtractor ie = ImageExtractor.INSTANCE;
 
            List<Image> images = ie.process(new URL(url), extractor);
 
            Collections.sort(images);
            String image = null;
            if (!images.isEmpty()) {
                image = images.get(0).getSrc();
            }
 
            return new Content(title, content.substring(0, 200), image);
        } catch (Exception e) {
            return null;
        }
 
    }
}

The above mentioned code does the following :

  1. It first fetches the document at the given url.
  2. Then it parses the HTML document and return TextDocument.
  3. Next we get the title from the text document.
  4. Finally, we extract the content from the text and return a new instance of the application value object.

Step 5 : Enable JAX-RS

To enable JAX-RS, create a class which extends javax.ws.rs.core.Application and specify the application path using the javax.ws.rs.ApplicationPath annotation as shown below.

import javax.ws.rs.ApplicationPath;
import javax.ws.rs.core.Application;
 
@ApplicationPath("/api/v1")
public class JaxrsInitializer extends Application{
 
 
}

Step 6 : Create ContentExtractionResource

Now we will create our ContentExtractionResource class which will return a content object as JSON. Create a new class named ContentExtractionResource and replace the code with the contents shown below:

import javax.inject.Inject;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.QueryParam;
import javax.ws.rs.core.MediaType;
 
import com.newsapp.service.BoilerpipeContentExtractionService;
import com.newsapp.service.Content;
 
@Path("/content")
public class ContentExtractionResource {
 
    @Inject
    private BoilerpipeContentExtractionService boilerpipeContentExtractionService;
 
    @GET
    @Produces(value = MediaType.APPLICATION_JSON)
    public Content extractContent(@QueryParam("url") String url) {
        return boilerpipeContentExtractionService.content(url);
    }
}

Deploy to OpenShift

Finally, deploy the changes to OpenShift

$ git add .
$ git commit -am "NewApp"
$ git push

After the code is pushed and the war is successfully deployed, we can view the application running at http://newsapp-{domain-name}.rhcloud.com. My sample application is running at http://newsapp-t20.rhcloud.com.

Now you can test by submitting a link in the application ui.

That's it for today. Keep giving feedback.

What's Next

Tags:

hi Shekhar

Very nice article.

  1. Can we get the complete source-code of this application.

  2. Use seem to use some ImageExtractor, but it's code is no where visible on this page.

Regards Nitin