Geocoding Using Google Refine and Data Science Toolkit

029th Aug 2011Matt's Thoughts, Code Snippets, , , , , , ,

In this tutorial we are going to demonstrate how you can use Google Refine and the Data Science Toolkit street2coordinates API to geocode a CSV (Excel) file.

If you are not already familiar with Google Refine, it is an open source data clean-up tool released by Google. The app runs a tiny web server on your local machine that then allows you to load a CSV file and manipulate the results. It’s really great at cleaning up data and working with JSON API services. Watch the tutorial videos for a good overview on it’s features.

In today’s demo we are assuming that you have a list of addresses in a CSV file that you want to get the latitude and longitude for; perhaps for a mapping project.

First install Google Refine on your local machine.

Hopefully you have a CSV file that contains the full address in a single column. If you don’t use can use the following excel formula to create one:


= FIELD_ADDRESS & " " & FIELD_CITY & ", " & FIELD_STATE & " " & FIELD_ZIP

Just replace the FIELD_NAME with your corresponding excel columns

Next you will want to run Refine and load your CSV file. This will create a new project and show you the first few rows of your data.

Find the column with your full address and select the drop down arrow in the header

Select Edit Column > Add Column by Fetching URL

To geocode the file I used the Data Science Toolkit, which is fast, open source and doesn’t have any API limits. I had to create a helper script that standardizes the returned function. You can download this script here and run it on your own web server.

In the new window Google Refine opened give the new column a name… this is a temporary data store so the name doesn’t matter. Change the rate limit to 0 instead of 5000. In the query box enter the following:


'http://www.yourserver.com/index.php?address=' + escape(value,'url')

When you click ok the system will then run through all of the records in your CSV file and temporarily store the JSON feed result from the DSTK service. This may take a while depending on the size of your list.

Next to extract the latitude and longitude find the new temporary column and select the arrow next to the header.

Select Edit Column > Add Column Based on this Column

Call the column longitude and enter the following in the formula field:


value.parseJson().result.longitude

This parses the JSON feed and creates a new column with the longitude value.

Repeat this last process again to get latitude.

This can be used in many other ways. Take a look at the sample JSON return to see what else you can merge.


{
-1600 pennslyvania Ave washington dc: {
country_name: "United States"
longitude: -77.037528
fips_county: "11001"
region: "DC"
locality: "Washington"
confidence: 0.736
street_address: "1600 Pennsylvania Ave NW"
country_code: "US"
street_number: "1600"
country_code3: "USA"
latitude: 38.898746
street_name: "Pennsylvania Ave NW"
}
}

Related posts:

  1. Using our APIs is Absurdly Easy
  2. Mining patterns in search data with Google Correlate
  3. Data Geeks for Obama?

No Comments Comments Feed

Add a Comment