Getting Started with Lucene and Full Text Indexing

Greetings Shifters, today I am going to give you a quick introduction to Lucene. For those of you who haven't heard of it, Lucene is probably one of the earliest NoSQL datastores whose sweetspot is full-text indexing. Today's post is just going to cover the basics of how to index content you already have. The next set of posts will cover searching the index and adding spatial capabilities.

Today we are going to take the JSON files I created for the national park web service and index the name field. I am going to use Jackson to parse the JSON in Java. Then we are going to index it with Lucene 4.4. Finally I am going to use Luke(actually a custom version that supports Lucene 4.4) to look at the contents of the index.

I used Netbeans 7.4 Beta to work on the code for this project but it is simple to run this code in any IDE. All the code (including the necessary jar file and data file) can all be found in a github repo I put together.

What is Lucene?

Lucene is a set of libraries and file formats for excellent full text searching. It includes a bunch of libraries that have really interesting ways to analyze the text (even in multiple langugages). It is also the basis for more complete indexing and searching solutions such as ElasticSearch and Solr. You can think of Lucene as the file format and lower level libraries to work with it while Solr and ElasticSearch add a lot of functionality on top of this. As a rough analogy, you can think of Lucene like TIFF and LibTIFF versus GIMP or PhotoShop.

The basic idea with Lucene is you take data and place it in fields to either be stored, indexed, or both indexed and stored. Indexed means you can search against that field, stored means you can not search against the field but you can retrieve it's contents. There are also non-stored and non-indexed fields but they are primarily used for the storage of metadata.

You take the information, put it into fields, put those fields into a "document", and then add the document to the index. The index is a set of files on disk (or in memory such as a RAMdisk). There are multiple files contained in an an index and the files are platform independent. You can build an index on a Windows machine but then use it on a Linux machine or vice versa. When moving the index you move all the files in the directory.

The information to be put in the index can be from any source you can read into Java (There is also a Python port called PyLucene but I have no experience with it). For example, you could use PDFBox to extract information from PDF files or POI to read infromation from Microsoft Office files and put them into Lucene. I have used both libraries when I worked with Lucene on the Yale Economic Growth Center Digital Library which took paper statistical books, scanned them to PDF and Excel, and then made them searchable and retrievable in a format to be used in research.

Let's get to some code:

Today's post will only be on lucene and only the bare minimum to make an index. So let's get to it.

If you look at my source, there are a lot of comments so I am just going to list the classes and then pull out some of the highlights.

Main.java:

The executable class I run to index all the contents. It does not take any parameters as the path to the JSON file and the location to place the index are hard-cded in the application. The flow is basically:

  1. Set up some auxillary classes we are going to use
  2. Try to open the dir where we want to write the lucene index
  3. Open the JSON file
  4. Read each line of the JSON file
  5. Map the contents of the line to the Park class using the Jackson ObjectMapper
  6. Write the Park to the Lucene index
  7. Close the index

Park.java:

This is just a POJO that is used for mapping the Parks out of the JSON file. Notice the Jackson annotation on the name fied. Jackson expects the names in the POJO to match the name in the JSON, including capitalization. Since Name is capitilized in the JSON, it does not match Java convention for variables to be lower-cased. Therefore we put the annotation to say, "Yeah, we really mean the name in the JSON is uppercased".

FileOpener.java:

I like clean separation of concern in my classes so I wrote a small little class that is responsible for opening the JSON file and then passing a reader back to the calling class. To make this cleaner I think I would actually read each line of the file into an ArrayList and pass that back to the calling class. In this way Main.java would have no file handling classes in it. Oh well, that is what version 2.0 is for, right ;).

LuceneIndexer.java:

Finally we cover the piece that does the Lucene work. The openIndex method basically opens the directory for writing a Lucene index into it. We set the analyzer, which determines how to parse and index and indexable fields. We also set the mode of the file to overwrite any contents that are already there when we say: OpenMode.CREATE. Finally we set the class instance of the writer equal to the one we created.

The addPark method takes in a Park POJO, creates a document, uses TextField to create an indexable field that contains the name, we also store the name so we can retrieve it later. Then we use StoredField on the String representation of the position array to only store the field but not make it searchable. Finally we add the document to the index.

When we are all done with the index, we use the finish method to commit our changes to the index and then release the lock and close the file handle to the index.

Viewing Results

At the end of this process, you now have a directory with a Lucene index inside it. If you cloned the git repository for Luke from above, you can now fire up Luke. It turns out that the version you need is lukeall-4.4.0.jar, since this also contains all the dependencies to run Luke. I also found that I could not get it to open any other directory other than the one it was housed in, so that is why you will see it in the git repository for my code in the indexDir.

To run it just type:

    java -jar lukeall-4.4.0.jar

Then just press the ok button on the opening screen and you will see the contents of your index.

Screen shot of Luke, the lucene index viewer

On the picture, I marked where you can see the fields, see some of the top indexed terms, switch tabs and look at each document, or even try out searches using Lucene search syntax.

The index I built is actually included in the git repo, so if you clone my git repo, you can actually start playing with it right away.

What's Next?

Well I hope this short little introduction helped you to understand Lucene. I will be giving a talk on using Lucene for spatial application at FOSS4G2013 and hope to do a follow up blog post on how to use it. For now you should try indexing some of your own content and then looking for great applications to build off of it. Happy indexing....

Thanks, nice post