Host Your Spatial Data in a PostGIS Database

Spatial data in space smoking a relaxing pipe

Okay Shifters, today we are going to move our spatial data out of shapefiles and into some more usable formats. Oh about the leading image - it is Data in Space, a little play on spatial data. Do you see what I did there? Pretty witty non?

This is the third in the spatial applications on a PaaS series I have been writing. The first post covered some of the basics of GIS/spatial data while the second post covered some Python code to process the data we were interested in using.

Now, we are going to convert our data to GeoJSON and also put it in into PostGIS, the spatial extension for PostgreSQL. We are making GeoJSON for use in D3, the excellent JavaScript library, since it can be imported and used without any further conversion.

Time to Remember (or learn) Some Shell Scripting

This time we are going to use OGR and command line PostGIS tool, shp2pgsql, to convert from our shapefiles into other formats. Since we have quite a few shapefiles to convert we write a small little bash script to convert them. Maybe some nice user will fork the repository and write a PowerShell or dos script for all the Windows users. Or if you really want to try and use bash on Windows, you can use MSYS for unix emulation

Take the code below and put it in a text file with a .sh extension and then run it at the command line inside the directory with all the shapefiles we created in blog 2:

#! /bin/bash
 
##This script takes our processed shapefiles and turns them into GeoJSON and SQL files to be imported into PostGIS
 
#this automatically does file listing of everything that ends in shp
for var in *shp
do
    #find "shp" in var and replace with "pg.sql" and then store it as a variable
    output_pg=${var//shp/pg.sql}
 
    #find "shp" in var and replace with ".json" and then store it as a variable
    output_gjson=${var//shp/json}
 
    #strip the file name of "_final.shp" and store it in a variable
    table_name=${var%_final.shp}
 
    #use OGR to make geojson and put the output into a different directory
    ogr2ogr -f "GeoJSON" ../gjson/$output_gjson $var
 
    #need the postgis client tools for this one to work
    #take the files and output the sql files to a different directory
    shp2pgsql -s 4326 -I -D $var  $table_name > ../psql/$output_pg 
 
    printf ":%s\n" $table_name 
 
done

If you stayed in bash then you can run either of the files (on *nix or Mac) one of two ways:

bash script.sh
 
--or--
 
chmod +x script.sh
./script.sh

The comments in the code (the lines beginning with #) should make the overall flow of the code apparent. Let's dig in on the two geospatial lines. First is ogr2ogr, which is the command line tool for OGR. The tool basically allows you to convert between all sorts of vector formats. In our case, the -f is followed by the format we want to export. Ogr2ogr is smart enough to know the input format based on the file extension. If you want to see the full list of supported export formats you can just type in:

ogr2ogr --long-usage

at the command prompt and see all the formats. We are writing the output into a directory titled gjson and we are naming the files based on the original name of the file.

The next command is shp2pgsql and is included with the PostGIS client tools and is specifically made for "importing" shapefiles into PostGIS. Basically the commands takes the shapefile and spits out the SQL equivalent for PostGIS, including the table structure. The -s flag is for the input EPSG code for the shapefiles projection ( a concept covered in the first blog post ). The -I specifies that the SQL should include building a spatial index on the geometry field. The geometry field will be created with a name of "geom" unless you specify something different. The -D tells the command to use PostgreSQL dump format rather than SQL insert statements, for a faster import and smaller file size. We send the output of the command to a new directory called psql where the table and file name is based off the the original file name.

We could have use ogr2ogr with the PGDump driver, but:

  1. I like showing a diversity of tools so you understand more of your options
  2. You should install the PostgreSQL and PostGIS client tools on your machine if you are going to do application development with this database
  3. The PGDump driver was not installed with my ogr2ogr installation.

On to the database and OpenShift

We are not going to do anything with the GeoJSON files today so you can just put them somewhere safe (you can always take them out on rainy days and play with them). To use the the PostGIS files on OpenShift we need to create an application that has Postgresql. We are also going to make this application with Tomcat, the Java application server, as our main web hosting application. We do this because the next post in this series will be to add GeoServer to our stack and this runs in a Java Application Server.

Now when creating this application we have two options - we can either make it a scalable application which means Postgresql and Tomcat will each go on their own gear (here is an explanation of gear and cartridge and how they are related). You will need to use this option if you want other applications on OpenShift to use your PostGIS database. For example, if you follow these instruction but then also want to create a Python web application to talk to the same PostGIS database, then you should make a scaleable application now. The trade-off is that, by making a scalable application, you will use 2 free gears (one for Tomcat and HAProxy and then a separate one for Postgresql). If you want to get more gears (or other resources) you can use this form to make a request.

The other option is to make an embedded application, where all the cartridges are on one gear. This will only use one gear for your complete application but all the application server and database memory and disk usage will count against your 512 megs of RAM and your 1 gig of disk space. For this series, I advise you to request more resource now because I plan to show GeoDjango and CKAN running on OpenShift. Both of these require Python as the application type. I would also request medium gears (1 gig of memory) - so that we can give more RAM to Postgresql and Tomcat. Again, use this form to request more resources.

Given that background, here are the different ways to create the application using the command line tools. I am also assuming you have signed up for an account, if not please do so pronto. One pre step, application creation clones a git reposity to your local machine so make sure you execute these command where you want the repository to be created. I am going to use the final example - scalable application with medium gears

#embeded with small gears
rhc app create spatialapp Tomcat7 postgresql

#embedded with medium gears
rhc app create spatialapp Tomcat7 postgresql -g medium

#scalable with small gears
rhc app create spatialapp Tomcat7 postgresql -s

#scalable with medium gears
rhc app create spatialapp Tomcat7 postgresql -s -g medium

Once you execute this command you will wait while we spin up all the infrastructure for an app with Tomcat and Postgresql, create a git repository on your gear, start the application with the default template, register and propagate the DNS entry for your app, and then clone the git repository to your local machine.

Once the creation is done there will be a lot of information about your app printed to the console. You are free to store them somewhere, but it is ok if you don't. If you ever want to see all the information again you can just execute the following command:

rhc app show spatialapp

#or if you are inside the spatialapp dir, we can infer the app from the git repo

rhc app show

In the output from this command, at the end you will get the web address for your application (which should look like spatialapp-.rhcloud.com) and an SSH connection string. Go ahead and look at your app in a browser if you want, there won't be anything special there but it is live to the world.

Time to add PostGIS to Postgresql

First let's get the sql files up on our Postgresql gear. We are going to use scp to carry out our task but you could also use SFTP.

We are basically going to copy all the sql files from our local machine to either the temp directory or our persistent data directory. You could also add the data to your local git repository and push that up to the server but it is trickier to remove them once we are done with them and we don't actually need to version them AND it is bad git form to do this as well.

Navigate to the sql file directory ( the location referenced as ../psql/$output_pg from above). Inside that directory you can either do:

scp *.sql 513fff55e0b8cd808600XXX@spatialapp-spdemo.rhcloud.com:app-root/data/

or 

scp *.sql  513fff55e0b8cd808600XXX@spatialapp-spdemo.rhcloud.com:/tmp/

We now have our data up on the server and we can 1) SSH in 2) put PostGIS into our database and then 3) import all our data into PostGIS.

I actually wrote a blog post before about PostGIS so you can go there to read more background about PostGIS and how to set it up on OpenShift. The basic idea of PostGIS is to add first class spatial functionality to Postgresql. I would argue it is actually is the best spatial datastore available today.

Next take the SSH connection string and use that with SSH. In my case the command looks like this:

ssh 513fff55e0b8cd808600XXX@spatialapp-spdemo.rhcloud.com

You will notice you never entered a password to ssh into your application and that is because we use ssh keys. Nobody can get ssh access to your application unless they get your SSH private key or your add their public SSH key to the application.

Setting PostGIS up is easy now that you are SSH'ed into your application. By default OpenShift creates a database with the same name as your application.

Just three commands turns your database into a spatial database:

#add the PL/pgSQL language to our DB - http://www.postgresql.org/docs/8.4/static/plpgsql-overview.html

>psql spatialapp -c "create language plpgsql;"

#next add the PostGIS spatial functionality to our database

> psql -d spatialapp -f /usr/share/pgsql/contrib/postgis-64.sql

#finally add all the information for all the different projects that PostGIS understands

> psql -d spatialapp -f /usr/share/pgsql/contrib/spatial_ref_sys.sql

That's it - the spatialapp database is now ready to handle spatial data.

Let's go ahead and import our data into our database.

While you are SSH'ed in go ahead and change over to the directory where you scp'ed the files

cd /tmp

or 

cd ~/app-root/data 

and now we are going to make another little bash script (unless you want to type the same command about 100 times).

Here is where our blog post faces a dilemma. You can either edit this file directly on your OpenShift gear using your favorite Linux text editor ( I like Nano but go ahead with VI or Emacs). But I realize for a lot of my audience may not feel comfortable with this. In that case, my advice is to edit this file on your local machine using your favorite text editor and then scp it up to the gear. The scp command would be the same as the one above except instead of

scp *.sql …

use 

scp _psqlimporter.sh  …

Right, back to the file. The contents of the file will look very similar to our other file, basically we are going to loop through all the sql files and import them into the spatialapp database.

 #! /bin/bash
 
    ##This script needs to be put on our OpenShift gear where we scp'ed up the SQL files.
    ## When it is done we can delete the SQL files.
 
    #we set this so we don't have to enter the password for the command 
    PGPASSWORD=$OPENSHIFT_POSTGRESQL_DB_PASSWORD
 
    export PGPASSWORD
 
    #grab all the files ending in sql
    for var in *sql
    do
        #this just pipes the input of the file into the psql command line, executes it's content, and exits.
        #we could have also used -f
        psql -h $OPENSHIFT_POSTGRESQL_DB_GEAR_DNS -p $OPENSHIFT_POSTGRESQL_DB_PORT -U admin $OPENSHIFT_APP_NAME < $var   
    done

If you chose to upload the file please SSH back into the application after you are done uploading. Then change to either /tmp or app-root/data (whichever you picked). If you edited right on the machine then please save and exit. Now in the directory where you wrote the file please do the following command line command:

chmod 770 _psqlimporter.sh 

This sets the file permissions to full permissions for the owner and group of the file and we will need execute permissions to make the script run. Now at the command line execute the following command:

./_psqlimporter.sh

And that's it - you now 99 tables and them being spatial is not a problem (my apologies to Jay-Z for mangling his lyrics).

Let's do some spatial queries

Here are some queries to show the power of what you can start to do with a real spatial database

1) Return all the parcels which are in the R-1 zoning areas

-- Zoning codes http://www.sccoplanning.com/LinkClick.aspx?fileticket=J2OTGOkwoJY%3d&tabid=1427
SELECT ap.apn FROM assessorsparcels AS ap, zoning AS z WHERE z.zoning = 'R-1' AND ST_Within(ap.geom, z.geom);

2) Return a center point for each house insuring the point is contained in the house footprint

--We will also turn the binary representation of the geometry into text
select gid, st_asewkt(st_PointOnSurface(geom)) from sccobldgfoot;

3) Select all the parcels whose property line is within 0.005 degrees of a certain parcel's property line

select ap.apn from assessorsparcels as ap where ST_Intersects(ap.geom, (select st_buffer(geom, 0.005) from assessorsparcels as ap  where ap.apn = '064-351-27' ));

To do more of this spatial wizardry I highly reccomend the PostGIS tutorial site, the PostGIS in Action book, and The Boston GIS site. Remember, we are using PostGIS 1.5 with Postgresql 8.4.

Conclusion

So with today's post you got to learn:

1) A little bit of bash scripting

2) how to convert data from shapefiles into other formats

3) How to bulk load spatial data into your OpenShift PostGIS instance

4) Some awesome queries for higher end spatial analysis

And now that we have a comprehensive spatial database with a lot of data in it, our next posts can start to focus on how you work with this data. For the next post we will bring up GeoServer, a full featured spatial application server. If you want to prep for that article go ahead and install Google Earth and QGIS (if you haven't already from the second blog article).

Remember, if you don't yet have an OpenShift account and the command line tools installed you should do that now. All the rest of the blog posts in this series will be using OpenShift (and this dataset) extensively.

As always, please leave feedback and questions and have fun playing with your spatial data.

What's Next?