Using Open Source GIS tools for spatial data - QGIS, GDAL and Python

This is the second in my new spatial application and data blog series. Our goal in this series is to show all the different ways we can create applications and run standard servers with geospatial data. I chose an area around Boulder Creek, CA in Santa Cruz County to obtain data to be used in this series.

In the first blog in this series I looked at how to get spatial data, what format it would probably come in, and how to deal with it's projection. Now that we have the data we are going to visualize it a bit and then do some programmatic analysis and reprojecting.

Desktop GIS and visualizing your data

Now that we have all this great data let's talk about how to look at and process it. The typical GIS analyst will use a desktop application to get started with their data, do analysis, and make PDF or printed maps.This is the second in my new spatial application and data blog series. Our goal in this series is to show all the different ways we can create applications and run standard servers with geospatial data. I chose an area around Boulder Creek, CA in Santa Cruz County to obtain data to be used in this series.

In the first blog in this series I looked at how to get spatial data, what format it would probably come in, and how to deal with it's projection. Now that we have the data we are going to visualize it a bit and then do some programmatic analysis and reprojecting.

Desktop GIS and visualizing your data

Now that we have all this great data let's talk about how to look at and process it. The typical GIS analyst will use a desktop application to get started with their data, do analysis, and make PDF or printed maps. The best and most active cross-platform FOSS GIS desktop software is QGIS. With QGIS you can look at all the shapefiles we downloaded and do processing on them. We will use QGIS throughout this blog series.

Today I am only going to show the bare minimum needed to do our work. If you want to learn more there are plenty of good tutorials and documentation.

Once you start QGIS up, you can add shapefiles by clicking on the green map with a + symbol on top:

For example, here is what it looks like when I add the zipcodes shapefile for Santa Cruz County from the Administrative directory:

Zip codes in Santa Cruz County

How I picked the area of interest

I used QGIS to extract and save the three zip codes of interest - basically the zip codes around Boulder Creek. I did this by setting a definition on the Zip Codes shapefile. You bring up this dialog by right clicking on the Zip Code Layer, selecting properties, and then go the "General Properties" tab. In the subset field you can set the criteria. Here we choose the 3 zip codes that are in the mountains and cover Big Basin State park and the state route 9 corridor.

QGIS properties box

Finally we right click on the zip code again and chose "save as". This allows us to save the layer as a shapefile with only the features we have retained in the new shapefile.

I called the shapefile zipcodes again and saved it in a directory called "_clip". In our case to make this quicker, since all the polygons touched we unioned them into 1 large polygon and used that for the tests below.

You can merge them into one shape by chooing vector -> geoprocessing -> dissolve and dissolve on the field titled "shape"". The output shapefile in my case was called the_aoi (short for Area Of Interest).

We are now going to use this processed shapefile file to select only those features in the other 115 shapefiles that intersect with our AOI. Intersect in GIS means two features touch at some boundary or one feature is wholely contained within another. We will end up with just the data we want and none of the rest fo the County data.

spatial imagery of Santa Cruzy county

Reading shapefiles and doing spatial processing with Python.

What to install:

The binaries you need on your machine are:

  1. OGR/GDAL - This library reads and write between many different spatial formats, including Shapefiles
  2. Geos - This library allows us to carry out a wide range of spatial operations on geometries
  3. PROJ.4 - This library does all the mathematical work to convert between datums and projections

Then using PIP you need to install:

  1. Fiona - wraps the OGR libraries in a Pythonic way - pip install Fiona
  2. Shapely - provides Pythonic style bindings to GEOS - pip install Shapely
  3. PyProj - as you might guess this is a Python wrapper for PROJ.4 - pip install pyproj

Once all that is installed you can then checkout my code from github:

https://github.com/thesteve0/shapefileconverter

The basic flow for the code listed above is:

  1. There is a directory where we put the shapefile that has the geometries we want to use to select features in the other shapefiles.
  2. There is another directory where we have all the files that we want to select out where they intersect with our Area of Interest (AOI)
  3. There is a third directory where we will write out all the shapefiles which have been filtered and projected.
  4. The code loads the geometries to use for selecting
  5. Iterate through each of the other files and save only the features that touch or are contained within our zipcodes (intersect).
  6. Then project that feature to EPSG:4326 and write it to a shapeile it with the same name as the source file except it has a _proj on the end of the file name.

I am not going to walk through the code line by line here but I did put quite a few comments in the source code that should help guide you as to what the program is doing.

When we are done running this script we will have 115 new shapefiles which contain only those features that overlapped our zip codes of interest. The geometry for these new features will all have their coordinate in lattitude and longitude with a datum of WGS 84. By doing this process we were able to cut our total disk space of shapefiles from 324 Megs down to 33 Megs, which is a manageable size to work with for our blog series. We also have a good sampling of different feature layers, with everything from points for road drainages to building footprints for each structure in our area.

When I ran the idea of this blog post and code by some of my geospatially savy friends they also had suggestions for other ways to accomplish the same task. In the end I didn't use them because I wanted to spend more time programming in Python and I wanted to expose you to some of the programming libraries used to work with spatial data.

Other techniques would have been to use OGR at the command line with a BASH script, to clip using the two different shapefiles. Then in that same Bash script use PROJ.4 to reproject each of the files. Here is some discussion of using OGR at the command line and some dicussion of PROJ.4 at the command line.

We also could have loaded up all the shapefiles into QGIS and then used Python scripting in QGIS to carry out the same tasks. Here are some docs dealing with Python in QGIS.

A third option would have been to loaded all the data into PostGIS the most powerful FOSS GIS database available. Once the shapefiles were loaded we could have used SQL to do the intersections and projections and saved the new features into new tables. From there we could have exported shapefiles from PostGIS. I didn't use that technique here because I plan to do a blog post on loading our data into PostGIS and other spatial databases.

The main goal for this second blog post in the series was to show you some ways to visualize and programmatically manipulate spatial data. I have really just touched the tip of the iceberg here, leaving out Java and C libraries that can also do a lot of this work. As a matter of fact there is quite a robust Java community that deals with geospatial data, for example JTS, Geotools, and Sextante provide a very robust toolkit for working with spatial data.

In the next blog post we will start loading this data into spatial databases: PostGIS and MongoDB. With data projected, selected, and ready to go you can take some time and play around in QGIS or maybe do some more work with Shapely and the Python script I provided. We have only just begun to to use spatial data, the fun part is still coming!

Hey readers: I put an issue in github to figure out where to put all the data to host for this blog series. https://github.com/thesteve0/shapefileconverter/issues/1

You can also put a comment here if you don't want to put your comments in github. Thanks Steve

I'm trying to do Spatial Analysis using Hadoop on OpenStack and I also noticed recently a research team at University of Minnesota developed a module to support spatial data and index in Hadoop's. But I found it's still in the very beginning stage.

I have done something on loading shapefile into PostGIS and I'm curious how you would develop an demo application with PostGIS.

As you also mentioned MogoDB, I don't take a deep look into it and I'm not sure whether MogoDB's spatial implementation follows the standards from OGC and how many features it has implemented.

How can we handle large data with OpenShift?

Anyway, you start a very, very interesting topic and I like it. (I want to work on this very much, too.)

Thanks.

iJab Zhan

Hey iJab: My next blog post in this series will load our data into PostGIS. In the meantime, here are some sites to get you going with PostGIS

I highly reccomend the PostGIS tutorial site, the PostGIS in Action book, and The Boston GIS site. Remember, we are using PostGIS 1.5 with Postgresql 8.4.

I also have another blog post on how to get started with Postgis on OpenShift

https://www.openshift.com/blogs/time-for-a-spatial-power-up-openshift-postgis

Hi, Steven,

Thanks so much for your reply.

I have built a routing service with PostGIS and pgRouting, and I think it would be a good idea to demo a web-based GIS service after loading data into PostGIS. I'll try this on OpenShift.

Appreciate your excellent work.

iJab