Store Large Files in the Cloud With MongoDB GridFS

Openshift MongoDB GridFS

One of the nice features of MongoDB is its support to store large files using GridFS. Using GridFS, you can store files greater than 16 MB(the max size of a document) in MongoDB. It stores large objects by splitting them into small chunks, usually 256k in size. Each chunk is then stored as a separate document in a chunks collection. Metadata about the file, including the filename, content type,etc. is stored as a document in a files collection. If your files are smaller, then you can also use the bin data type supported by MongoDB.

GridFS has lot of advantages, and you should give it a try in case you need to store big files like images or videos. Some of the advantages of using GridFS are :

  1. You can add metadata to the objects and run queries against these attributes.
  2. You can replicate your binary data and can get backup, failover and read scalability.
  3. You can shard your binary data and achieve write scalability.

In this blog, I will talk about how you can use GridFS to store user uploaded files in MongoDB running on OpenShift. We will be using Spring MongoDB GridFS support in our Java application.

Step 1: Sign up for an OpenShift account

If you don’t already have an OpenShift account, head on over to the website and signup. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

Step 2: Install the client tools on your machine

The OpenShift client tools are written in a very popular programming language called Ruby. With OSX 10.6 or later and most Linux distributions, ruby is installed by default so installing the client tools is a snap. Simply issue the following command on your terminal application:

sudo gem install rhc

Step 3 : Setting up OpenShift

The rhc client tool makes it very easy to setup your openshift instance with ssh keys, git and your applications namespace. The namespace is a unique name per user which becomes part of your application url. For example, if your namespace is cix and application name is gridfs then url of the application will be https://gridfs-cix.rhcloud.com/. The command is shown below.

rhc setup -l <openshift_login_email>

Step 4: Creating JBoss AS7 MongoDB application

After doing all the mandatory setup, let's create an application named "gridfs". To create a Java application type the command shown below.

rhc app create gridfs jbossas-7 mongodb-2.2

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. The gear will have JBoss AS7 and MongoDB installed. OpenShift will also setup a private git repository for you and clone the repository to your local system. Finally, OpenShift will propagate the DNS to outside world.

Step 5 : Using mongofiles utility in cloud

MongoDB comes with a utility called mongofiles, which you can find in the bin folder of MongoDB installation. According to MongoDB documentation

The mongofiles utility makes it possible to manipulate files stored in your MongoDB instance in GridFS objects from the command line. It is particularly useful as it provides an interface between objects stored in your file system and GridFS.

To play with mongofiles utility, let's ssh into the OpenShift instance using the ssh command. You can read more about how to access application gear at https://www.openshift.com/faq/can-i-access-my-applications-gear.

$ cd gridfs
$ rhc app ssh

Go to OPENSHIFT_DATA_DIR directory and create a new test file. The data directory is a writable directory and you can download or create files here as shown below.

[gridfs-cix.rhcloud.com data]\>  cd $OPENSHIFT_DATA_DIR 
[gridfs-cix.rhcloud.com data]\>  echo "hello world" > test.txt

Next run the mongofiles put command to insert the test.txt file in MongoDB. Execute the command shown below.

[gridfs-cix.rhcloud.com data]\>  mongofiles -d demo put test.txt -u $OPENSHIFT_MONGODB_DB_USERNAME -p $OPENSHIFT_MONGODB_DB_PASSWORD -h $OPENSHIFT_MONGODB_DB_HOST -port $OPENSHIFT_MONGODB_DB_PORT

This will output something like as shown below.

connected to: 127.4.151.1:27017
added file: { _id: ObjectId('5139830339cbe83ddb502323'), filename: "test.txt", chunkSize: 262144, uploadDate: new Date(1362723587698), md5: "6f5902ac237024bdd0c176cb93063dc4", length: 12 }
done!

But at the beginning of this blog I said we should store large files in MongoDB using GridFS. Let's download a large file. I am downloading sintel animation movie which is 123 MB in size. You can download this file using the wget command as shown below into $OPENSHIFT_DATA_DIR.

[gridfs-cix.rhcloud.com data]\>  wget http://peach.themazzone.com/durian/movies/sintel-1024-stereo.mp4

Let's now run the mongofiles put command to push the sintel movie to MongoDB.

[gridfs-cix.rhcloud.com data]\>  mongofiles -d demo put sintel-1024-stereo.mp4 -u $OPENSHIFT_MONGODB_DB_USERNAME -p $OPENSHIFT_MONGODB_DB_PASSWORD -h $OPENSHIFT_MONGODB_DB_HOST -port $OPENSHIFT_MONGODB_DB_PORT
 
------ output -------
 
connected to: 127.4.151.1:27017
added file: { _id: ObjectId('513984329dd40888612087dc'), filename: "sintel-1024-stereo.mp4", chunkSize: 262144, uploadDate: new Date(1362723893311), md5: "f3cc2a1e97d271a00663faafcb138c97", length: 101257792 }
done!

Finally you can list the files using mongofiles utility as shown below.

[gridfs-cix.rhcloud.com data]\> mongofiles -d demo list -u $OPENSHIFT_MONGODB_DB_USERNAME -p $OPENSHIFT_MONGODB_DB_PASSWORD -h $OPENSHIFT_MONGODB_DB_HOST -port $OPENSHIFT_MONGODB_DB_PORT
 
------ output -------
 
connected to: 127.4.151.1:27017
test.txt    12
sintel-1024-stereo.mp4  101257792
[gridfs-cix.rhcloud.com data]\> 

Step 6 : Pulling the code

I have created a demo Spring MVC MongoDB application which uploads data to MongoDB using GridFS. To pull the code from my github repository execute the git commands shown below.

$ git rm -rf src/ pom.xml
$ git commit -am "deleted template files"
$ git remote add upstream -m master git://github.com/shekhargulati/gridfs-openshift-demo.git
$ git pull -s recursive -X theirs upstream master

Step 7 : Pushing the code

After pulling the code you can push the code to OpenShift using push command shown below.

git push

Step 8 : Play with the application

Finally you can upload your files using the demo application at http://gridfs-cix.rhcloud.com/upload. Upload the application and then log in to the MongoDB instance and see the files.

Step 9 : Code Walkthrough - Under the hood

Lets's now look at the code. There are only couple of classes in the code -- MongoDBConfig and UploadController. The config classes is for declaring Spring beans and controller is a simple Spring MVC upload controller. Let's look at both the classes one by one.

MongoDBConfig is a spring configuration class which contains definitions for mongodbFactory and gridfsTemplate beans. The configuration class is shown below. GridFsTemplate is a Spring MongoDB template class which provide implementation for performing operations(store, list, delete etc.) on GridFS.

@Configuration
public class MongoDBConfig {
 
    @Bean
    public MongoDbFactory mongoDbFactory() throws Exception {
        String openshiftMongoDbHost = System.getenv("OPENSHIFT_MONGODB_DB_HOST");
        int openshiftMongoDbPort = Integer.parseInt(System
                .getenv("OPENSHIFT_MONGODB_DB_PORT"));
        String username = System.getenv("OPENSHIFT_MONGODB_DB_USERNAME");
        String password = System.getenv("OPENSHIFT_MONGODB_DB_PASSWORD");
        Mongo mongo = new Mongo(openshiftMongoDbHost, openshiftMongoDbPort);
        UserCredentials userCredentials = new UserCredentials(username,
                password);
        String databaseName = System.getenv("OPENSHIFT_APP_NAME");
        MongoDbFactory mongoDbFactory = new SimpleMongoDbFactory(mongo,
                databaseName, userCredentials);
        return mongoDbFactory;
    }
 
    @Bean
    public GridFsTemplate gridFsTemplate() throws Exception{
        MongoDbFactory dbFactory = mongoDbFactory();
        MongoConverter converter = mongoConverter();
        GridFsTemplate gridFsTemplate = new GridFsTemplate(dbFactory, converter);
        return gridFsTemplate;
    }
 
    @Bean
    public MongoConverter mongoConverter() throws Exception{
        MongoMappingContext mappingContext = new MongoMappingContext();
        MappingMongoConverter mappingMongoConverter = new MappingMongoConverter(mongoDbFactory(), mappingContext);
        return mappingMongoConverter;
    }
 
 
}

UploadController is a Spring MVC controller which exposes two operations -- first to render upload form and second to store the uploaded item to GridFS. The code for UploadController is shown below.

@Controller
@RequestMapping(value = "/upload")
public class UploadController {
 
    @Autowired
    GridFsTemplate gridFsTemplate;
 
    @RequestMapping(method = RequestMethod.GET)
    public String getUploadForm(Model model) {
        model.addAttribute(new UploadItem());
        return "upload/uploadForm";
    }
 
    @RequestMapping(method = RequestMethod.POST)
    public String create(UploadItem uploadItem, BindingResult result) throws Exception{
        if (result.hasErrors()) {
            for (ObjectError error : result.getAllErrors()) {
                System.err.println("Error: " + error.getCode() + " - "
                        + error.getDefaultMessage());
            }
            return "upload/uploadForm";
        }
 
        gridFsTemplate.store(uploadItem.getFileData().getInputStream(), uploadItem.getFileData().getOriginalFilename());
 
        return "upload/uploadSuccess";
    }
}

That's it for this blog. Start using OpenShift and build great applications using MongoDB.

What's Next?

hi Shekhar, Nice info on gridfs using spring. We are evaluating a product which can handle 2GB filesystem using GridFS. Is the Spring GridFStemplate going to work on large filesystem ? I am hoping , Spring gridfstemplate will handle filechunks instead of storing it in a param variable. correct me ,if i am wrong.