Tag: PostGIS

How I used ST_ClusterDBSCAN to locate clusters of multiple, similar parcels

Alternative headline: A practical example of how to use ST_ClusterDBSCAN to find similar real estate properties.

Oftentimes a developer wants to acquire several adjacent lots for a single redevelopment. Each standard sized lot in Chicago is about 3,125 square feet (25 feet wide and 125 feet deep). Because of downzoning in 2004, and since, the zoning rules for many lots allow only about 3-4 dwelling units each. Multiple lots are required to develop buildings with 6-9 dwelling units, which is a sweet spot in Chicago for design and avoiding having to get an upzone.

Chicago Cityscape has long had Property Finder, a tool to locate parcels that meet exacting specifications given existing lot size, current zoning district, distance to transit, and other criteria.

Now, Chicago Cityscape can locate parcels that are adjacent or near each other that all meet the user’s specified criteria (what the website calls “filters”). This is possible because of the PostGIS function ST_ClusterDBSCAN.

ST_ClusterDBSCAN considers all geospatial features in your result set (whatever matches the WHERE clause) and assigns them to a cluster ID according to two inputs: minimum cluster size, and maximum distance each feature can be from any other feature in order to be considered in the same cluster as that other feature.

The function can also assign a feature with a cluster ID of NULL, indicating that the feature did not meet the clustering criteria and is alone.

Show me what that looks like

Chicago Cityscape gives the user three options to cluster: Small, compact clusters with at least 3 properties each; small, compact clusters with at least 5 properties each; large, loose clusters with at least 10 properties each.

Additionally, Chicago Cityscape lets the user choose between showing parcels that weren’t found in a cluster, or hiding parcels that weren’t found in a cluster. The reason to show parcels that weren’t found in a cluster is to visualize where there are and aren’t clusters of parcels in the same map.

A map of Chicago’s Near West Side community area is shown with clusters of vacant lots. The “show all properties” mode is used, which shows clusters with a thick, black outline. Properties that were not in a cluster are still shown but without the thick black outline (enlarge the photo to see the difference).

Sample query

This query looks at all of the vacant lots within 1 mile of the intersection of Washington Boulevard and Karlov Avenue in the West Garfield Park community area of Chicago. The query looks for clusters of at least 3 features (“minpoints”) that are no more than 25 feet apart (“eps”). (The data are projected in Illinois StatePlane East Feet, rather than a projection that’s in meters because it’s easier for me to work with feet.)

I posted another sample query below that’s used to exclude all of the features that were not assigned to a cluster.

SELECT pin14, ST_ClusterDBSCAN(geom, eps := 25, minpoints := 3) over () AS cid, geom
FROM parcels
WHERE property_class = '1-00'
	AND ST_DWithin(geom,
        ST_Transform(
            ST_GeomFromText('POINT(-87.7278 41.8819)', 4326), 3435),
           5280)

The screenshot below shows clusters of vacant lots that resulted from the query above. The parcels symbolized in a gray gradient were not assigned to a cluster. Notice how clusters will form across the alleys but not across streets; this is because the streets are wider than 25 feet but most alleys are only 16 feet wide.

The map shows various groups (clusters) of vacant properties in West Garfield Park. Each cluster is symbolized in QGIS using a different color. Properties that are not in a cluster are symbolized by a gray gradient.

Exclusion sample query

This query is the same as above except that a Common Table Expression (CTE) is used (CTEs have the “WITH” keyword at the beginning) to create a subquery. The “WITH” subquery is the one that clusters the parcels and the following query (“SELECT *”) throws out any features returned by the subquery that don’t have a cluster ID (the “cid” field).

with parcels as (
SELECT pin14, ST_ClusterDBSCAN(geom, eps := 25, minpoints := 3) over () AS cid, geom
FROM parcels
WHERE property_class = '1-00'
	AND ST_DWithin(geom,
        ST_Transform(
            ST_GeomFromText('POINT(-87.7278 41.8819)', 4326), 3435),
           5280)
) select * 
from parcels where cid is not null;

I would also recommend Dan Baston’s blog post from six years ago which has more commentary and explanation, and additional examples of how to use the function.

How to visualize the density of point data in a grid

A common way to show the distribution of places (like grocery stores) is to use a heat map. The map will appear “hotter” where there are many grocery stores and “colder” where there are few grocery stores. This kind of map can be useful to show gaps in distribution or a neighborhood that has a lot of grocery stores.

One issue with that kind of heat map is that the coverage areas change their shape and color if you zoom in, since the algorithm that clusters or determines what’s “nearby” or dense has fewer locations to analyze.

I prefer to use grids in the shape of square tiles, since Chicago is grid-oriented city and the vast majority of our streets and our routes move along east-west and north-south lines. The map above shows the location of subjects and topics of news articles in the Chicago Cityscape database.

I use PostGIS to set up most of my spatial data before visualizing it in QGIS.

This tutorial shows the two steps to using PostGIS to (1) create a grid based on an existing area (polygon), (2) assigning each point location to a tile in that grid and counting the number of locations in that tile.

If you don’t have PostGIS installed, you should install it if you work with spatial data a lot. It is much, much faster at performing most of the tasks you use QGIS or ArcGIS to perform. Both QGIS and ArcGIS can read and edit data stored in PostGIS.

Additionally, there is a function within QGIS that can create grids, and another function that can do comparisons by location and count/summarize, so all of this can be done without PostGIS.

For this tutorial, you will need a single polygon with which to create the grid. I used the boundary of the City of Chicago limits.

  1. Create a grid based on an existing area

1.a. Add a new function to PostGIS

To create a grid, you need a function that draws the tiles based on the polygon. I got this from The Spatial Database Advisor.

-- Create required type
DROP   TYPE IF EXISTS T_Grid CASCADE;
CREATE TYPE T_Grid AS (
  gcol  int4,
  grow  int4,
  geom geometry
);
-- Drop function is exists
DROP FUNCTION IF EXISTS ST_RegularGrid(geometry, NUMERIC, NUMERIC, BOOLEAN);
-- Now create the function
CREATE OR REPLACE FUNCTION ST_RegularGrid(p_geometry   geometry,
                                          p_TileSizeX  NUMERIC,
                                          p_TileSizeY  NUMERIC,
                                          p_point      BOOLEAN DEFAULT TRUE)
  RETURNS SETOF T_Grid AS
$BODY$
DECLARE
   v_mbr   geometry;
   v_srid  int4;
   v_halfX NUMERIC := p_TileSizeX / 2.0;
   v_halfY NUMERIC := p_TileSizeY / 2.0;
   v_loCol int4;
   v_hiCol int4;
   v_loRow int4;
   v_hiRow int4;
   v_grid  T_Grid;
BEGIN
   IF ( p_geometry IS NULL ) THEN
      RETURN;
   END IF;
   v_srid  := ST_SRID(p_geometry);
   v_mbr   := ST_Envelope(p_geometry);
   v_loCol := trunc((ST_XMIN(v_mbr) / p_TileSizeX)::NUMERIC );
   v_hiCol := CEIL( (ST_XMAX(v_mbr) / p_TileSizeX)::NUMERIC ) - 1;
   v_loRow := trunc((ST_YMIN(v_mbr) / p_TileSizeY)::NUMERIC );
   v_hiRow := CEIL( (ST_YMAX(v_mbr) / p_TileSizeY)::NUMERIC ) - 1;
   FOR v_col IN v_loCol..v_hiCol Loop
     FOR v_row IN v_loRow..v_hiRow Loop
         v_grid.gcol := v_col;
         v_grid.grow := v_row;
         IF ( p_point ) THEN
           v_grid.geom := ST_SetSRID(
                             ST_MakePoint((v_col * p_TileSizeX) + v_halfX,
                                          (v_row * p_TileSizeY) + V_HalfY),
                             v_srid);
         ELSE
           v_grid.geom := ST_SetSRID(
                             ST_MakeEnvelope((v_col * p_TileSizeX),
                                             (v_row * p_TileSizeY),
                                             (v_col * p_TileSizeX) + p_TileSizeX,
                                             (v_row * p_TileSizeY) + p_TileSizeY),
                             v_srid);
         END IF;
         RETURN NEXT v_grid;
     END Loop;
   END Loop;
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE
  COST 100
  ROWS 1000;

The ST_RegularGrid function works in the same projection as your source data.

1.b. Create a layer that has all of the tiles for just the Chicago boundary

--This creates grids of 1,320 feet square (a 2x2 block size in Chicago)
SELECT gcol, grow, geom
 into b_chicagoboundary_grid_1320squared
 FROM ST_RegularGrid((select geom from chicagoboundary where gid = 1), 1320, 1320,FALSE);

In that query, “1320” is a distance in feet for both the X and Y planes, as the “chicagoboundary” geometry is projected in Illinois StatePlane FIPS East (Feet) (EPSG/SRID 3435).

2. Assign each point location to a tile in that grid and count the number of locations in each tile

Now you’ll need a table that has POINT-type geometries in it. For the map in this tutorial, I used a layer of location-based news articles that are used in Chicago Cityscape to highlight local developments.

SELECT grid.id, count(*) as count, grid.geom
INTO news_grid
FROM news_articles, b_chicagoboundary_grid_1320squared AS grid
WHERE st_intersects(news_articles.geom, grid.geom)
GROUP by grid.id;

This query will result in a table with three columns:

  1. The ID of the tile, which is a primary key field.
  2. The number of news articles in that tile.
  3. The POLYGON geometry of that tile.
Look at these two maps (the one above, and the one below). The first map shows the whole city. The tiles are colored according to the number of news articles within the area of each tile. The darker the blue, the more news articles within that tile.
This map is zoomed in to the Woodlawn area. As you change scale (zoom in or zoom out), the size of the “heat” area (the size of each tile) doesn’t change – they are still 1,320 feet by 1,320 feet. The color doesn’t change either. The typical heat map doesn’t have these advantages.

The genius of using ST_Subdivide to speed up PostGIS intersection comparisons

You should use ST_Subdivide to break up large shapes into smaller ones

This infographic visually compares the difference between running a PostGIS comparison query like ST_Intersects on a large shape versus a subdivided version of that large shape. Click to embiggen.

Hundreds of GIS intersection comparisons are completed every hour on Chicago Cityscape.*

People are looking at, say, a map of the South Shore community area. That “Place” page then grabs all of the building permits, building violations, business licenses, and other “feature layers” that are stored as points.

A classic “point in polygon” comparison is made using the ST_Intersects(place_geometry, permits_geometry) function.

This has worked very well for several years.

The problem

But as Chicago Cityscape handles larger shapes – they come from users drawing their own, large shapes, and from large shapes like the downtown Chicago area – this query doesn’t cut it.

Setting indexes on the geometry is imperative, but it’s not the end of the to optimize performance. That’s because the index of the geometry is a rectangular bounding box (which is also called an “envelope” in GIS) that contains the entire shape of the South Shore community area.

The downtown Chicago area, however, is not even the largest shape I have. That belongs to the new Place, “Neighborhood Opportunity Fund investment zones” (NOF). Combined, they cover 75 square miles of Chicago. Downtown is only 7.7 square miles.

After I added the NOF map and tested its Place page, it “crashed” my server, metaphorically speaking. The query to just count the number of building permits in the area would take about five minutes.

There had to be a better way; in the meantime, however, I divided the NOF map into the West and South sections. This hardly improved the counting time.

The solution

Thankfully, today, I saw a tweet from Paul Ramsey linking to his blog that linked to his slides from a recent presentation about the use of PostgreSQL to store and manipulate GIS data.

In it he explained how the ST_Subdividefunction worked. I’m going to demonstrate it using graphics from my own maps.

A normal intersection comparison, using ST_Intersects(place_geometry, permits_geometry) in a query creates a bounding box (envelope) around each geometry and quickly determines whether the two envelopes overlap. If they do, then it checks again to see if the actual geometries overlap. If they do, that data is returned as a response to your query.

When your two datasets are massive, like the NOF zones, which collectively cover 1/3rd of Chicago, and the building permits, which are found across the entire city…well, that led to the five minutes counting time.

Enter ST_Subdivide. To use it properly you would run it against your existing geometry and store the much smaller shapes, derived from the big shape, in a new table. I applied the function to all the 22,203 maps that Chicago Cityscape has and stored their unique IDs and subdivided geometries in a new “lookup” table.

Now, any time I want to compare the building permits against the NOF, the building permits are instead compared to the small shapes that were subdivided.

The query

Chicago Cityscape uses a single table (created as a materialized view) to combine all 22,203 maps. Each map is stored in a source table (for example, there’s a table to hold the 77 community areas) and the materialized view runs once a day to combine all of the maps in the source tables. This ensures our data is managed well: different source tables can hold different information, and the single table holds only the name, type, and geometry of the source tables, for faster comparison. Each entry in the single table also has a “slug”, its unique identifier.

Thus, the materialized view of the subdivided maps is created from the aforementioned single table, using this query:

create materialized view view_places_subdivided as
select gid || '_' || random() as gid, slug, st_subdivide(geom) as geom
from view_places;

The “gid” is designed to create a new unique ID field, as the slug field will be repeated for every subdivided of each map. A unique ID field is necessary if you want to refresh the materialized view concurrently (to allow for other queries to access the materialized view while it’s being refreshed).

* The results are cached for a few hours, because the feature layers change 1-2 times per day and at different times each day, so the limited duration cache accommodates that. Ideally I would code a way to invalidate the cache when the feature layer data is updated.

Update 12/31/19: ST_Subdivide will fail if your geometries have any or certain geometry errors (I don’t know if it’s any kind of error, or certain kinds of errors that make the function fail). Chicago Cityscape has over 37,000 features that ST_Subdivide is attempting to process, and there is a lot of room for error in managing that many features from dozens of sources.

Crashes by bike or by foot at different intersections

While working on a private web application that I call Chicago Crash Browser, I added some code to show the share of pedestrian and pedalcyclist crashes. The site offers users (sorry I don’t have a web server that can make it public) a list of the “Top 10” intersections in terms of bike crash frequency (that’s bike+auto crash). You can click on the intersection and a list will populate showing all the pedestrian and pedalcyclist crashes there, sorted by date. At the bottom of the list is a simple sentence that tells what percentage pedestrian and pedalcyclists made up at that intersection.

I’m still developing ideas on how this information may be useful, and what it’s saying about the intersection or the people using it.

Let me tell you about a few:

Milwaukee Avenue and Ogden Avenue

I mentioned in my article Initial intersection crash analysis for Milwaukee Avenue that this intersection is the most bike crash-frequent.

23 crashes within 150 feet of the center, 2005-2010

82.61% bike crashes **

17.39% ped crashes.

Ashland Avenue and Division Street

28 crashes within 150 feet of the center, 2005-2010

46.43% bike crashes

53.57% ped crashes **

Milwaukee, North and Damen Avenues

46 crashes within 150 feet of the center, 2005-2010

39.13% bike crashes

60.87% ped crashes **

Halsted Street, Lincoln and Fullerton Avenues

38 crashes within 150 feet of the center, 2005-2010

42.11% bike crashes

57.89% ped crashes **

Montrose Avenue and Marine Drive (Lake Shore Drive ramps)

11 crashes within 150 feet of the center, 2005-2010

90.91% bike crashes **

9.09% ped crashes

Why do you think some intersections have more of one kind of crash than the other?

People walking at Milwaukee-North-Damen.

The Chicago Crash Browser can be made public if I have a host that offers the PostgreSQL database. Do you have one to offer?