Getting and preparing data for your spatial application
Greetings shifters, I am going to start another blog series on working with geospatial applications. When I talk to users who want to get started in geospatial development one of the first questions I get is
"Where do I get some data?"
Today's blog post has nothing to do with OpenShift and everything to do with:
- Explaining some basic geospatial data file formats
- Explaining some concepts you need to understand with spatial data
This will be a long post since there is a lot of ground to cover, but it is required reading for the rest of the series. If you don't feel like reading it all, here are the concepts you will need to learn:
- Projections and Datums including EPSG codes
- GDAL and OGR
- And if you like Python then Fiona (an OGR binding) and Shapely
With that said, let's get to it!
Let's go get some data
Where to get data
For this series we are going to focus on downloading or creating geospatial data and then hosting it back out to the world in several different ways. If all you want to do is display some data you should probably look to consume a public API, such as Google Maps or FourSquare and use that instead.
Our goal is to see all the different ways we display some data for Boulder Creek, California. We want to do everything from a portal to custom applications.
There are a myriad of sites on the internet where you can get geospatial data. In the United States, because of our laws about government data sharing, you can usually find information for states, counties, and many cities. Outside the U.S., for local data, the situation is trickier. The magic phrase you want to put into google is " download gis data" or " download shapefiles". This should give you some idea if data is available for your area.
National Level data at the U.S. can be found at government data sites but can be tricky to find. You are going to have to use the search phrase from above but instead of using place of interest put the data you are interested in obtaining. There is a geospatial portal for the U.S. Federal government, though I find it a bit difficult to find what I want.
There are also several other places you can look for international and local data. One very famous one, and which I am a contributor to, is OpenStreetMap - think of it as a Wikipedia of map data. It started in Europe, where the government doesn't traditionally share data, as a way to create a publicly reusable roads database. It has grown in to much more than that with the storage of varied information and over a million registered contributors.
Some other places include:
For this blog series we are going to work with an area in Santa Cruz County around a town called Boulder Creek. This area is near my house but is also small enough that it won't get much attention from the County to create a portal for just this area, the way the City of Santa Cruz has it's own site.
I went to the Santa Cruz County government site to the download page for their geospatial data:
I ended up downloading 115 different shapefiles (coverages), with all sorts of different information. As you will typically find, most of these shapefiles do not come with metadata to explain what is in them, such as when they were created, or who the contact person is for the data. In this case we will have to pretend we know what each of the shapefiles represent.
What is this shapefile of which you speak - understanding shapefiles
The most widely used format to transfer spatial data between computers (and also to do analysis as well) is called a Shapefile. A "shapefile" actually consists of a minimum of at least 3 files. All the files will have the same name except for the postfix. At a minimum you need the *.shp (the geometries), *.dbf (the tabular data), and the *.shx (links the two files together) for a shapefile to be whole. There can by many other extensions included, with the most important being a *.prj which contains the projection information (covered below).
The shapefile format is an open standard created by ESRI and released in approximately 1996 with Arc View 2.0. It was designed to be cross-platform (even on the shiny new Windows platform) and to be fast for reading, writing, and rendering. It was quickly adopted as a geospatial format because of it's easy of transfer AND it's open specification. Almost all geospatial (GIS) software can read and write shapefiles.
It was a great file format when it was released almost 20 years ago, but there are some serious limitations.
- Since the tabular file is DBase format you don't have Unicode support or column names with more than 8 characters
- It explicitly does not keep track of topology so it is possible to store invalid geometries in the file
- And despite the spec saying there should only be one geometry type per file, it is common to come across files with two different geometry types for spatial features.
When is a Polygon not a Polygon
Point 3 above may leave you a bit confused, so I will try to explain a bit more. In a shapefile, and GIS in general, you have 3 basic vector shapes that you learned about in primary school geometry: point, line, and polygon. It turns out though that you can also have Multipoint, Multiline, and Multipolygon. The easiest example to think about is a country with some islands on a map of the world. The islands and the mainland are all part of the country and they would share the same country level attributes like country name, year established, or leader. But to accurately represent them you need to use multiple polygons, hence the multipolygon.
The problem arises when software doesn't follow the specification. The OGR library, which we are going to use today, follows the specification and prevents you from writing polygons and multi-polygons to the same shapefile.
Unfortunately not all software is so well behaved and you may get data that has polygons and multipolygons in the same shapefile even though the shapefile says it is of type polygon. If you try to read this data using OGR or other packages you will have no problems, but you can't use metadata from the shapefile to determine the geometry output file. In our code we are just going to coerce all polygons and lines to multipolygons and multilines.
When we get to the blog post that talks about spatial databases I will cover some newer, and better, alternatives to the shapefile. Unfortunately at this time it is hard to find sites that use these newer formats.
Understanding Projections and Datums
Another concept you need to understand when working with spatial data come from an area of study called Geodesy - we need to dive into projections and datums. I am going to give a VERY brief introduction to these topics and leave it up to you to dig in deeper if and when you need it.
Long story short - if you go to plot your data and it is not showing up in the right locations on the map, it is probably a projection/datum problem
In slightly less short but still brief introduction:
Projections deal with how your take a 3D object (the Earth) and make it 2D (a map). When you do this process some truths on the globe have to be sacrificed to make a map, you will have to distort one or more of area, shape, directionality, or distance. The projection you pick to make your map depends on what you are trying to display on the map.
Datums handle how you take a rough irregularly shaped ellipsoid (the Earth) and create a mathematical formula that best approximates that ellipsoid for the whole planet or just for a particular area you are interested mapping. Depending on which model you use, you can actually end up plotting your coordinates in very different locations on a map.
The following diagram shows a local and global datum. The actual shape of the earth is represented by the blue line, notice how it is not smooth because there are trenches and mountains. The red line is a datum that tries to do a best fit for the entire globe. It can't do the ups and down of the mountains and trenches but it minimizes error across the entire earth. The green line represents a datum that is modeled to be locally accurate but poor for the entire earth. Notice how little error there is in the bottom right corner between the blue and green line, but how terrible it is for the top left of the globe. So the datums are smooth mathematical models that approximate the surface of the earth, either locally or globally.
image is borrowed from Cross and Higgins
Here is an example where, using lat and long (no projection) but just changing the datum will make for very different locations for the Texas state capitol dome. Notice how poorly the Japanese datum does in Texas.
These two factors, along with your units of distance, define your coordinate system. You need to understand coordinate systems because the data you download may be in a coordinate system that is not right for your mapping purposes. In that case you will need to reproject your data to a new coordinate system that works better for your display and analysis.
The European Petroleum Standards Group (EPSG) has come up with numbers for almost any projection you will deal with in normal everyday mapping. Given an EPSG number you know precisely which coordinate system you are dealing with and which one you want to go to. Fortunately, some great members of the geospatial community created a site SpatialReference.orgthat lists almost all the EPSG numbers along with other ways of representing those same coordinate systems (such as in well know text). Bookmark this site if you plan to work with geospatial data a lot.
If you write an application that gets GPS coordinates from your phone, browser that supports geolocation, or the defaults from a handheld GPS unit you will be getting your data as Lat, Long, and WGS-84 datum. The EPSG for the 2D version of this is EPSG:4326 and 3D is EPSG:4979. The Google and Bing Maps projection is EPSG:3857 and actually uses a spherical datum rather than an ellipsoid.
The data we downloaded all came in California State Plane ZONE III projection with a North American Datum 1983 datum and units of feet.
This will be very accurate for preserving area, distance, and direction in Bay Area of California. Notice that because California is so large there are actually 6 different zones needed to get accurate projections.
In the end we want our data to display on Google earth and in web mapping programming so we will project it all now to EPSG:4326 discussed above. If this was for precise mapping uses, such as a "call before you dig" application, we should leave it in CA State Plane. Almost all the software we will use in this series has the capability to reproject data on request.
I left out a whole class of data that is also very important in spatial applications, raster data. An example of raster data is satellite images or aerial photographs. I will cover this kind of data in a later blog post, deeper into the series.
That was a lot of information to cover but is what I consider the bare minimum you need to be able to work with spatial data that you will download from different sources. You have learned about the most common spatial data format and some of the ways that geographers have dealth with taking a 3D spheroid and converted it into 2D coordinates.
In our next blog post we will show how to do some desktop visualization and analyis of the Santa Cruz County data. We will also show some python and C libraries that can be used to manipulate spatial data. I promise, by the third series in this post we will actually be putting data on servers and using OpenShift.