Day 16: Goose Extractor--An Article Extractor That Just Works

Today for my 30 day challenge, I decided to learn how to do article extraction using the Python programming language. I have been interested in article extraction for a few month when I wanted to write a Prismatic clone. Prismatic creates a news feed based on user interest. Extracting article's main content, images, and other meta information is a very common requirement in most of the content discovery websites like Prismatic. In this blog post, we will learn how we can use a Python package called goose-extractor to accomplish this task. We will first cover some basics, and then we will develop a simple Flask application which will use the Goose Extractor API.

What is Goose Extractor?

Goose Extractor is an open source article extraction library written in Python. It can be used to extract the main text of an article, main image of an article, videos in an article, meta description, and meta tags in an article. Goose was originally written in Java by Gravity.com and then most recently converted to a scala project.

From the Goose Extractor website

Goose Extractor is a complete rewrite in python. The aim of the software is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Why should I care?

The reason I decided to learn Goose Extractor are as follows:

  1. I wanted to develop applications which require article extraction. Goose Extractor stands on the strong shoulders of NTLK and Beautiful Soup, which are the leading libraries for text processing and HTML parsing.

  2. I wanted to learn how article extraction can be done in Python.

Install Goose Extractor

Before we can get started with Goose Extractor, we need to install Python and virtualenv on the machine. The Python version I am using in this blog post is 2.7.

We will use the pip install to get started with Goose Extractor. For developers unaware of pip, it is Python package manager. We can install pip from the official website. Go to any convenient directory on your file system, and run following commands.

$ mkdir myapp
$ cd myapp
$ virtualenv venv --python=python2.7
$ . venv/bin/activate
$ pip install goose-extractor

The commands above will create a myapp directory on the local machine, then activate virtualenv with Python version 2.7, then install the goose-extractor package.

Github Repository

The code for today's demo application is available on github: day16-goose-extractor-demo.

Application

The demo application is running on OpenShift http://gooseextractor-t20.rhcloud.com/. It is a very simple example of using Goose Extractor API. Users can submit a link, and application will show the title, main image, and first 200 characters of the main text.

Goose Extractor Demo app running on OpenShift

We will develop a simple Flask application which will expose a REST API. If you are not aware of Flask, you can refer to my earlier post on it.

Next we will install the Flask framework. To install the Flask framework, we will run first activate the virtualenv and then use pip to install Flask.

$ . venv/bin/activate
$ pip install flask

As I mentioned in my earlier blog post on Flask, it is awesome for writing REST based web services. Create a new file called app.py under the myapp folder.

$ touch app.py

Copy the following code and paste it in the app.py source file

from flask import Flask, request, render_template,jsonify
from goose import Goose
 
app = Flask(__name__)
 
@app.route('/')
@app.route('/index')
def index():
    return render_template('index.html')
 
@app.route('/api/v1/extract')
def extract():
    url = request.args.get('url')
    g = Goose()
    article = g.extract(url=url)
    response = {'title' : article.title , 'text' : article.cleaned_text[:250],'image': article.top_image.src}
    return jsonify(response)
 
if __name__ == "__main__":
    app.run(debug=True)

The code shown above does the following:

  1. It imports the Flask class, request object, jsonify function, and render_template function from flask package.

  2. It imports the Goose class from goose package.

  3. It defines a route to '/' and 'index' url. So, if a user makes a GET request to either '/' or '/index', then the index.html will be rendered.

  4. It defines a route to '/api/v1/extract' url. We first get the 'url' query paramter from the request object. Then, we create an instance of Goose class. Next, extract the article, and then finally, create a json object and return it back. The json object contains title, cleaned text, and main image of the article.

  5. Finally, we start the development server to run the application using the python app.py command. We also enabled debugging by passing Debug=True. Debugging provides an interactive debugger in the browser when an unexpected exceptions occur. Another benefit of the debugger is that it will automatically reload the changes. We can keep the debugger running in the background and work through our application. This provides a highly productive environment.

The index() function renders an html file. Create a new folder called templates in the myapp directory and then create new file named index.html.

$ mkdir templates
$ touch templates/index.html

Copy the content to the index.html source file which uses Twitter Boostrap to add style. We are also using jQuery to make REST calls on a keyup event. We don't make REST calls if key is backspace, tab, enter, left , right, up, down.

<!DOCTYPE html>
<html>
<head>
    <title>Extract Title, Text, and Image from URL</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" type="text/css" href="static/css/bootstrap.css">
    <style type="text/css">
    body {
      padding-top:60px;
      padding-bottom: 60px;
    }
  </style>
</head>
<body>
 
<div class="navbar navbar-inverse navbar-fixed-top">
      <div class="container">
        <div class="navbar-header">
          <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
          </button>
          <a class="navbar-brand" href="#">TextExtraction</a>
        </div>
 
    </div>
  </div>
 
<div id="main" class="container">
    <form class="form-horizontal" role="form" id="myform">
        <div class="form-group">
            <div class="col-lg-4">
                <input type="url" id="url" name="url"  class="form-control" placeholder="Url you want to parse" required>
            </div>
        </div>
        <div class="form-group">
            <input type="submit" value="Extract" id="submitUrl" class="btn btn-success">
        </div>
    </form>
</div>
 
<div id="loading" style="display:none;" class="container">
    <img src="/static/images/loader.gif" alt="Please wait.." />
</div>
 
<div id="result" class="container">
 
</div>
 
<script type="text/javascript" src="static/js/jquery.js"></script>
<script type="text/javascript">
    $("#myform").on("submit", function(event){
        $("#result").empty();
        event.preventDefault();
        $('#loading').show();
        var url = $("#url").val()
        $.get('/api/v1/extract?url='+url,function(result){
            $('#loading').hide(); 
            $("#result").append("<h4>"+result.title+"</h4>");
            $("#result").append("<img src='"+result.image+"' height='300' width='300'</img>");
            $("#result").append("<p class='lead'>"+result.text+"</p>");
    })
 
 
    });
 
</script>
</body>
</html>

You can copy the js and css files from my github repository.

In the HTML file shown above, we make a REST call on form submission. After we receive the response, we append it to result div.

Deploy to the cloud

Before we can deploy the application to our cloud environment, we'll have to do few setup tasks :

  1. Sign up for an OpenShift Account. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

  2. Install the rhc client tool on your machine. The rhc is a ruby gem so you need to have ruby 1.8.7 or above on your machine. To install rhc, just typesudo gem install rhc If you already have one, make sure it is the latest one. To update your rhc, execute the command shown below.sudo gem update rhc For additional assistance setting up the rhc command-line tool, see the following page: https://www.openshift.com/developers/rhc-client-tools-install

  3. Setup your OpenShift account using rhc setup command. This command will help you create a namespace and upload your ssh keys to OpenShift server.

To deploy the application on OpenShift just type the command shown below.

$ rhc create-app day16demo python-2.7 --from-code https://github.com/shekhargulati/day16-goose-extractor-demo.git --timeout 180

It will do all the stuff from creating an application, to setting up public DNS, to creating private git repository, and then finally deploying the application using code from my Github repository.The application will be deployed on http://day16demo-{domain-name}.rhcloud.com. Please replace {domain-name} with your account domain name. The app is running here http://gooseextractor-t20.rhcloud.com/

That's it for today. Keep giving feedback.

What's Next

I tried the demo, it returns a 500 error when you submit a URL.

http://gooseextractor-t20.rhcloud.com/ seems to be working for me.

Working fine for me as well.