Tag: open data

A map of maps

The map of maps.

Over on my website Chicago Cityscape I’ve assembled a map of maps: There are 20,432 maps in 36 layers. You might say there are 36 maps, and each of those maps has an arbitrary number of boundaries within. I say there are 20,000+ maps because there’s a unique webpage for each of them that can tell you even more information about that map.

This post is to throw out some analysis of these maps, in addition to the simple counts above.

The data comes from the City of Chicago, Cook County, and the U.S. Census Bureau. Some layers have come from bespoke sources, including the entrances of CTA and Metra stations drawn by Yonah Freemark and me for Transit Explorer. The sections of the Chicago River were divided and sliced by the Metropolitan Planning Council. The neighborhood and business organizations layers were drawn by me, by interpreting textual descriptions of the organizations’ boundaries, or by visually copying an organization’s own map.

There are 6,879 unique words longer than 2 characters, in the metadata of this map of maps. The most common word is “annexation”, which makes sense, given that the layer with the most maps shows the 10,668 Cook County annexation actions since 1830 – the first known plat was incorporated in the City of Chicago.

The GeoJSON file, an open source, human readable GIS format, comes out to 30 MB, and it make break your browser when you try to display this layer.

The next group of words are also generic, like “planned” and “development”, related to the Planned Development kind of zoning process in Chicago – called Planned Unit Development in other jurisdictions.

After that, some names of municipalities that traded back and forth between unincorporated Cook County and incorporated municipalities are on the list.

Working down the list, however, it gets really boring and I’m going to stop. I bet if you’re a smarter data science person you can find more interesting patterns in the words, but I’ve also increased the number of generic words (like planned development) by adding these as keywords to each map’s “full text search” index, to ensure that they would respond to a variety of search phrases from users.

How to extract highways and subway lines from OpenStreetMap as a shapefile

It’s possible to use Overpass Turbo to extract any object from the OpenStreetMap “planet” and convert it from a GeoJSON or KML file to a shapefile for manipulation and analysis in GIS.

Say you want the subway lines for Mexico City, and you can’t find a GTFS file that you could convert to shapefile, and you can’t find the right files on Sistema de Transporte Colectivo’s website (I didn’t look for it).

Here’s how to extract the subway lines that are shown in OpenStreetMap and save them as a GIS shapefile.

This is my second tutorial to describe using Overpass Turbo. The first extracted places of worship in Cook County. I’ve also used Overpass Turbo to extract a map of campgrounds

Extract free and open source data from OpenStreetMap

  1. Open the Overpass Turbo website and, on the map, search for the city from which you want to extract data. (The Overpass query will be generated in such a way that it’ll only search for data in the current map view.)
  2. Click the “Wizard” button in the top toolbar. (Alternatively you can copy the code below and paste it into the text area on the website and click the “Run” button.)
  3. In the Wizard dialog box, type in “railway=subway” in order to find metro, subway, or rapid transit lines. (If you want to download interstate highways, or what they call motorways in the UK, use “highway=motorway“.) Then click the “build and run query” button.
  4. In a few seconds you’ll see lines and dots (representing the metro or subway stations) on the map, and a new query in the text area. Notice that the query has looked for three kinds of objects: node (points/stations), way (the subway tracks), relation (the subway routes).
  5. If you don’t want a particular kind of object, then delete its line from the query and click the “Run” button. (You probably don’t want relation if you’re just needing GIS data for mapping purposes, and because routes are not always well-defined by OpenStreetMap contributors.)
  6. Download the data by clicking the “Export” button. Choose from one of the first three options (GeoJSON, GPX, KML). If you’re going to use a desktop GIS software, or place this data in a web map (like Leaflet), then choose GeoJSON. Now, depending on what browser you’re using, a couple things could happen after you click on GeoJSON. If you’re using Chrome then clicking it will download a file. If you’re using Safari then clicking it will open a new tab and put the GeoJSON text in there. Copy and paste this text into TextEdit and save the file as “mexico_city_subway.geojson”.
Overpass Turbo screenshot 1 of 2

Screenshot 1: After searching for the city for which you want to extract data (Mexico City in this case), click the “Wizard” button and type “railway=subway” and click run.

Overpass Turbo screenshot 2

Screenshot 2: After building and running the query from the Wizard you’ll see subway lines and stations.

Overpass Turbo screenshot 3

Screenshot 3: Click the Export button and click GeoJSON. In Chrome, a file will download. In Safari, a new tab with the GeoJSON text will open (copy and paste this into TextEdit and save it as “mexico_city_subway.geojson”).

Convert the free and open source data into a shapefile

  1. After you’ve downloaded (via Chrome) or re-saved (Safari) a GeoJSON file of subway data from OpenStreetMap, open QGIS, the free and open source GIS desktop application for Linux, Windows, and Mac.
  2. In QGIS, add the GeoJSON file to the table of contents by either dragging the file in from the Finder (Mac) or Explorer (Windows), or by clicking File>Open and browsing and selecting the file.
  3. Convert it to GeoJSON by right-clicking on the layer in the table of contents and clicking “Save As…”
  4. In the “Save As…” dialog box choose “ESRI Shapefile” from the dropdown menu. Then click “Browse” to find a place to save this file, check “Add saved file to map”, and click the “OK” button.
  5. A new layer will appear in your table of contents. In the map this new layer will be layered directly above your GeoJSON data.
Overpass Turbo screenshot 4

Screenshot 4: The GeoJSON file exported from Overpass Turbo has now been loaded into the QGIS table of contents.

Overpass Turbo screenshot 5

Screenshot 5: In QGIS, right-click the layer, select “Save As…” and set the dialog box to have these settings before clicking OK.

Query for finding subways in your current Overpass Turbo map view

This has been generated by the overpass-turbo wizard.
The original search was:
// gather results
// query part for: “railway=subway”
/*relation is for "routes", which are not always
well-defined, so I would ignore it*/
// print results
out body;
out skel qt;

How to use Chicago Cityscape’s upgraded names search tool

Search for names of people who do business in Chicago.

I created a combined dataset of over 2 million names, including contractors, architects, business names, and business owners and their shareholders, from Chicago’s open data portal, and property owners/managers from the property tax database. It’s one of three new features published in the last couple of weeks.

Type a person or company name in the search bar and press “search”. In less than 1 second you’ll get results and a hint as to what kind of records we have.

What should you search?

Take any news article about a Chicago kinda situation, like this recent Chicago Sun-Times article about the city using $8 million in taxpayer-provided TIF district money to move the Harriet Rees house one block. The move made way for a taxpayer-funded property acquisition on which the DePaul/McCormick Place stadium will be built.

The CST is making the point that something about the house’s sale and movement is sketchy (although I don’t know if they showed that anything illegal happened).

There’re a lot of names in the article, but here are some of the ones we can find info about in Chicago Cityscape.

Salvatore Martorina – an architect & building permit expeditor, although this name is connected to a lot of other names on the business licenses section of Cityscape

Oscar Tatosian – rug company owner, who owned the vacant lot to which the Rees house was moved

Bulley & Andrews – construction company which moved the house

There were no records for the one attorney and two law firms mentioned.

Who are the top property owners in Cook County

235 West Van Buren Street

There are several hundred condo units in the building at 235 W Van Buren Street, and each unit is associated with multiple Property Index Numbers (PIN). Photo by Jeff Zoline.

Several people have used Chicago Cityscape to try and find who owns a property. Since I’ve got property tax data for 2,013,563 individually billed pieces of property in Cook County I can help them research that answer.

The problem, though, is that the data, from the Cook County combined property tax  website, only shows who receives the property tax bills – the recipient – who isn’t always the property’s owner.

The combined website is a great tool. Property value info comes from the Assessor’s office. Sales data comes from the Recorder of Deeds, which is another, separately elected, Cook County government agency. Finally, the Treasurer’s office, a third agency, also with a separately elected leader, sends the bills and collects the tax.

The following is a list of the top 100 (or so) “property tax bill recipients” in Cook County for the tax years 2010 to 2014, ranked by the number of associated Property Index Numbers.

Many PINs have changed recipients after being sold or divided, and the data only lists the recipient at its final tax year. A tax bill for Unit 1401 at 235 W Van Buren St was at one time sent to “235 VAN BUREN, CORP” (along with 934 other bills), but in 2011 the PIN was divided after the condo unit was sold.

Of the 100 names, DataMade’s new “probablepeople” name parsing Python script identified 13 as persons. It mistakenly identified eight names as “Person”, leaving five people in the top 100.

The actual number is closer to 90, arrived at by combining 5 names that seem to be the same (using OpenRefine’s clustering function) and removing 5 “to the current taxpayer” and empty names. You’ll notice “Altus” listed four times (they’re based in Phoenix) and Chicago Title Land Trust, which can help property owners remain private, listed twice (associated with 643 PINs).

[table id=2 /]

Working with ZIP code data (and alternatives to using sketchy ZIP code data)

1711 North Kimball Avenue, built 1890

This building at 1711 N Kimball no longer receives mail and the local mail carrier would mark it as vacant. After a minimum length of time the address will appear in the United States Postal Service’s vacancy dataset, provided by the federal Department of Housing and Urban Development. Photo: Gabriel X. Michael.

Working with accurate ZIP code data in your geographic publication (website or report) or demographic analysis can be problematic. The most accurate dataset – perhaps the only one that could be called reliably accurate – is one that you purchase from one of the United States Postal Service’s (USPS) authorized resellers. If you want to skip the introduction on what ZIP codes really represent, jump to “ZIP-code related datasets”.

Understanding what ZIP codes are

In other words the post office’s ZIP code data, which they use to deliver mail and not to locate people like your publication or analysis, is not free. It is also, unbeknownst to many, a dataset that lists mail carrier routes. It’s not a boundary or polygon, although many of the authorized resellers transform it into a boundary so buyers can geocode the location of their customers (retail companies might use this for customer tracking and profiling, and petition-creating websites for determining your elected officials).

The Census Bureau has its own issues using ZIP code data. For one, the ZIP code data changes as routes change and as delivery points change. Census boundaries needs to stay somewhat constant to be able to compare geographies over time, and Census tracts stay the same for a period of 10 years (between the decennial surveys).

Understanding that ZIP codes are well known (everybody has one and everybody knows theirs) and that it would be useful to present data on that level, the Bureau created “ZIP Code Tabulation Areas” (ZCTA) for the 2000 Census. They’re a collection of Census tracts that resemble a ZIP code’s area (they also often share the same 5-digit identifiers). The ZCTA and an area representing a ZIP code have a lot of overlap and can share much of the same space. ZCTA data is freely downloadable from the Census Bureau’s TIGER shapefiles website.

There’s a good discussion about what ZIP codes are and aren’t on the GIS StackExchange.

Chicago example of the problem

Here’s a real world example of the kinds of problems that ZIP code data availability and comprehension: Those working on the Chicago Health Atlas have run into this problem where they were using two different datasets: ZCTA from the Census Bureau and ZIP codes as prepared by the City of Chicago and published on their open data portal. Their solution, which is really a stopgap measure and needs further review not just by those involved in the app but by a diverse group of data experts, was to add a disclaimer that they use ZCTAs instead of the USPS’s ZIP code data.

ZIP-code related datasets

Fast forward to why I’m telling you all of this: The U.S. Department of Housing and Urban Development (HUD) has two ZIP-code based datasets that may prove useful to mappers and researchers.

1. ZIP code crosswalk files

This is a collection of eight datasets that link a level of Census geography to ZIP codes (and the reverse). The most useful to me is ZIP to Census tract. This dataset tells you in which ZIP code a Census tract lies (including if it spans multiple ZIP codes). HUD is using data from the USPS to create this.

The dataset is documented well on their website and updated quarterly, going back to 2010. The most recent file comes as a 12 MB Excel spreadsheet.

2. Vacant addresses

The USPS employs thousands of mail carriers to delivery things to the millions of households across the country, and they keep track of when the mail carrier cannot delivery something because no one lives in the apartment or house anymore. The address vacancy data tells you the following characteristics at the Census tract level:

  • total number of addresses the USPS knows about
  • number of addresses on urban routes to which the mail carrier hasn’t been able to delivery for 90 days and longer
  • “no-stat” addresses: undeliverable rural addresses, places under construction, urban addresses unlikely to be active

You must register to download the vacant addresses data and be a governmental entity or non-profit organization*, per the agreement** HUD has with USPS. Learn more and download the vacancy data which they update quarterly.

Tina Fassett Smith is a researcher at DePaul University’s Institute of Housing Studies and reviewed part of this blog post. She stresses to readers to ignore the “no-stat” addresses in the USPS’s vacancy dataset. She said that research by her and her colleagues at the IHS concluded this section of the data is unreliable. Tina also said that the methodology mail carriers use to identify vacant addresses and places under change (construction or demolition) isn’t made public and that mail carriers have an incentive to collect the data instead of being compensated normally. Tina further explained the issues with no-stat.

We have seen instances of a relationship between the number of P.O. boxes (i.e., the presence of a post office) and the number of no-stats in an area. This is one reason we took it off of the IHS Data Portal. We have not found it to be a useful data set for better understanding neighborhoods or housing markets.

The Institute of Housing Studies provides vacancy data on their portal for those who don’t want to bother with the HUD sign-up process to obtain it.

* It appears that HUD doesn’t verify your eligibility.

** This agreement also states that one can only use the vacancy data for the “stated purpose”: “measuring and forecasting neighborhood changes, assessing neighborhood needs, and measuring/assessing the various HUD programs in which Users are involved”.