Category: Data

Avoid “VARCHAR” columns in PostgreSQL [PGSQL Phriday #006]

February 28, 2023 / Steven Vance / 1 Comment

This is one thing I wish I knew while learning PostgreSQL: Don’t ever use the VARCHAR column type. I have never encountered a situation where that would have been superior to using TEXT.

The problem with VARCHAR column type is that it can be a fixed-width column that doesn’t allow text longer than a certain number of characters. This maximum character length can be introduced by the person who creates the table and adds the column.

The maximum character length can also be created by a GIS program (specifically, QGIS and the ogr2ogr command line tool) as well as some online tools that can convert spreadsheets to SQL and even write the table definition and “CREATE TABLE” statement. In both cases, the program tried to guess the column type and seemed to default to VARCHAR(y character length) instead of TEXT.

The maximum character length is a problem because trying to insert more text than allowed into that field will encounter an error and the INSERT or UPDATE statement will fail. I think it’s also a bad idea to create these columns because it makes it expensive to change your mind later about what length of text will go into that field.

Other things I wish I knew earlier

Using transactions to test your query and its affect on your data so that you can easily undo any changes.
How and when to use indexes beyond the “btree” index type (for example, “gin” for array and JSONB columns).
Add “time_created_at” and “time_updated_at” columns to basically every dataset. This makes it easy to sort things later; sometimes when creating a new dataset it’s not always obvious how you might want to display or manipulate the data until days, weeks, or months later. Sorting data by “when was the information changed last” is a common use case for me.

map of vacant properties in Near West Side community area

How I used ST_ClusterDBSCAN to locate clusters of multiple, similar parcels

November 18, 2022 / Steven Vance / 0 Comments

Alternative headline: A practical example of how to use ST_ClusterDBSCAN to find similar real estate properties.

Oftentimes a developer wants to acquire several adjacent lots for a single redevelopment. Each standard sized lot in Chicago is about 3,125 square feet (25 feet wide and 125 feet deep). Because of downzoning in 2004, and since, the zoning rules for many lots allow only about 3-4 dwelling units each. Multiple lots are required to develop buildings with 6-9 dwelling units, which is a sweet spot in Chicago for design and avoiding having to get an upzone.

Chicago Cityscape has long had Property Finder, a tool to locate parcels that meet exacting specifications given existing lot size, current zoning district, distance to transit, and other criteria.

Now, Chicago Cityscape can locate parcels that are adjacent or near each other that all meet the user’s specified criteria (what the website calls “filters”). This is possible because of the PostGIS function ST_ClusterDBSCAN.

ST_ClusterDBSCAN considers all geospatial features in your result set (whatever matches the WHERE clause) and assigns them to a cluster ID according to two inputs: minimum cluster size, and maximum distance each feature can be from any other feature in order to be considered in the same cluster as that other feature.

The function can also assign a feature with a cluster ID of NULL, indicating that the feature did not meet the clustering criteria and is alone.

Show me what that looks like

Chicago Cityscape gives the user three options to cluster: Small, compact clusters with at least 3 properties each; small, compact clusters with at least 5 properties each; large, loose clusters with at least 10 properties each.

Additionally, Chicago Cityscape lets the user choose between showing parcels that weren’t found in a cluster, or hiding parcels that weren’t found in a cluster. The reason to show parcels that weren’t found in a cluster is to visualize where there are and aren’t clusters of parcels in the same map.

A map of Chicago’s Near West Side community area is shown with clusters of vacant lots. The “show all properties” mode is used, which shows clusters with a thick, black outline. Properties that were not in a cluster are still shown but without the thick black outline (enlarge the photo to see the difference).

Sample query

This query looks at all of the vacant lots within 1 mile of the intersection of Washington Boulevard and Karlov Avenue in the West Garfield Park community area of Chicago. The query looks for clusters of at least 3 features (“minpoints”) that are no more than 25 feet apart (“eps”). (The data are projected in Illinois StatePlane East Feet, rather than a projection that’s in meters because it’s easier for me to work with feet.)

I posted another sample query below that’s used to exclude all of the features that were not assigned to a cluster.

SELECT pin14, ST_ClusterDBSCAN(geom, eps := 25, minpoints := 3) over () AS cid, geom
FROM parcels
WHERE property_class = '1-00'
	AND ST_DWithin(geom,
        ST_Transform(
            ST_GeomFromText('POINT(-87.7278 41.8819)', 4326), 3435),
           5280)

The screenshot below shows clusters of vacant lots that resulted from the query above. The parcels symbolized in a gray gradient were not assigned to a cluster. Notice how clusters will form across the alleys but not across streets; this is because the streets are wider than 25 feet but most alleys are only 16 feet wide.

The map shows various groups (clusters) of vacant properties in West Garfield Park. Each cluster is symbolized in QGIS using a different color. Properties that are not in a cluster are symbolized by a gray gradient.

Exclusion sample query

This query is the same as above except that a Common Table Expression (CTE) is used (CTEs have the “WITH” keyword at the beginning) to create a subquery. The “WITH” subquery is the one that clusters the parcels and the following query (“SELECT *”) throws out any features returned by the subquery that don’t have a cluster ID (the “cid” field).

with parcels as (
SELECT pin14, ST_ClusterDBSCAN(geom, eps := 25, minpoints := 3) over () AS cid, geom
FROM parcels
WHERE property_class = '1-00'
	AND ST_DWithin(geom,
        ST_Transform(
            ST_GeomFromText('POINT(-87.7278 41.8819)', 4326), 3435),
           5280)
) select * 
from parcels where cid is not null;

I would also recommend Dan Baston’s blog post from six years ago which has more commentary and explanation, and additional examples of how to use the function.

Chicago Crash Browser updates with stats, filters, and news article links

August 9, 2022 / Steven Vance / 0 Comments

Today I’m adding a bunch of new features to the Chicago Crash Browser, which lives on Chicago Cityscape.

But first…special access is no longer required. Anyone can create a free Cityscape account and access the map. However, only those with special access or a Cityscape Real Estate Pro account will be able to download the data.

Five new features include:

Statistics that update weekly to summarize what happened in the past week: the number of crashes, the number of people killed in crashes, and the number of people with the two worst tiers of injuries. The statistics are viewable to everyone, including those without access to the crash browser.

screenshot of the new weekly statistics — The statistics will update every Sunday. The numbers may change throughout the week as Chicago police officers upload crash reports.

For data users, the crash record ID is viewable. The crash record ID links details about the same crash across the Chicago data portal’s three tables: Crashes, Vehicles, and People. My Chicago Crash Browser is currently only using the Crashes table. Click on the “More details” arrow in the first table column.

screenshot of the data table showing the crash record ID revealed. — The crash record ID is hidden by default but can be exposed. Use this ID to locate details in the data portal’s Vehicles and People tables.

Filter crashes by location. There are currently two location filters: (1) on a “Pedestrian Street” (a zoning designation to, over time, reduce the prevalence of car-oriented land uses and improve building design to make them more appealing to walk next to); (2) within one block of a CTA or Metra station, important places where people commonly walk to. Select a filter’s radio button and then click “Apply filters”.
Filter crashes by availability of a news article or a note. I intend to attach news articles to every crash where a pedestrian or bicyclist was killed (the majority of these will be to Streetsblog Chicago articles, where I am still “editor at large”. Notes will include explanations about data changes [1] (the “map editor” mentioned in some of the notes is me) and victims’ names.

screenshot of the two types of filters — After choosing a filter’s radio button click “Apply filters” and the map and data table will update.

Filter by hit and run status. If the officer filling out the crash report marked it as a hit and run crash, you can filter by choosing “Yes” in the options list. “No” is another option, as is “not recorded”, which means the officer didn’t select yes or no.
Search by address. Use the search bar inside the map view to center the map and show crashes that occurred within one block (660 feet) of that point. The default is one block and users can increase that amount using the dropdown menu in the filter.

screenshot of the map after the search by address function has been used — Use the search bar within the map view to show crashes near a specific address in Chicago.

Footnotes

[1] The most common data change as of this writing is when a crash’s “most severe injury” is upgraded from non-fatal to fatal, but the crash report in the city’s data portal does not receive that update. This data pipeline/publishing issue is described in the browser’s “Crash data notes” section.

The “map editor” (me) will change a crash’s “most severe injury” to fatal to ensure it appears when someone filters for fatal crashes. This change to the data will be noted.

Chicagoland’s massive parking footprint – as measured on September 16, 2018

September 16, 2018 / Steven Vance / 1 Comment

Using the footprints of parking lots and garages drawn into OpenStreetMap as a data source, the area of land in Chicagoland occupied by parking lots and garages is 247,539,968 square feet. (The data was exported using HOT Export Tool; you can replicate my export.)

That converts to:

5,682.71 acres
8.88 mi^2 (square miles)
22.99 km^2 (square kilometers)
≈ 0.26 × area of Manhattan (≈ 87 km^2 )
3.9% area of Chicago is parking (Chicago is ~589.56 km^2 )

(I forgot to measure the portion of this within Chicago, and now the data snapshot is gone. I fixed this in the 2019 report.)

Why Jefferson Park residents should allow more housing

August 20, 2018 / Steven Vance / 0 Comments

Short answer: To provide more shoppers for the local businesses. Read on for the longer answer.

Over on Chicago Cityscape I added a new feature called “market analysis” which measures the number of people who live within specific walking areas (measured by time) and driving areas (measured by distance).

I am in favor of removing apartment & condo bans in Chicago, especially in areas where they were previously allowed and near train stations.

Jefferson Park is centered around two co-located train stations, serviced by CTA and Metra respectively. There have been multiple proposals for multi-family housing near the stations (collectively called the Jefferson Park Transit Center) and some have been approved.

Always, however, there are residents who resist these proposals and the number of originally proposed apartments or condos gets reduced in the final version (classic NIMBYism).

There’re four reasons – at least – why more housing should be allowed near the Jefferson Park Transit Center:

Locally owned businesses require a significant amount of shoppers who live nearby and walk up traffic
More people should have the opportunity to live near low-cost transportation
It will include more affordable housing, through Chicago’s inclusionary zoning rules (the Affordable Requirements Ordinance, ARO)
There will be less driving, and therefore lower household transportation costs and less neighborhood pollution

To support the first reason, I used the “market analysis” tool to see just how many people live in a walkable area centered around Veterans Square, a mixed-use office and retail development adjacent to the train stations.

Only 9,368 people live within a 10 minute walk to Veterans Square (get the Address Snapshot).

Comparatively, 19,707 people live within a 10 minute walk to The Crotch, or the center of Wicker Park, at the intersection of Milwaukee/North/Damen (get the Address Snapshot). The Blue Line station is about 75 feet south of the center point.

I would grant the low Veterans Square number a small discount based on the proximity to the Kennedy Expressway, which severely truncates walking areas up and down the northwest side. Still, even with that discount, ending up with less than half the amount as the one in Wicker Park, is disturbing. Wicker Park is hardly characterized by high-density housing. In fact, all of the new high-rises are just outside the 10 minute walk shed!

Selected series of posts

Personal Links

Receive new posts via email

Recent Posts

Category: Data

Avoid “VARCHAR” columns in PostgreSQL [PGSQL Phriday #006]

How I used ST_ClusterDBSCAN to locate clusters of multiple, similar parcels

Chicago Crash Browser updates with stats, filters, and news article links

Chicagoland’s massive parking footprint – as measured on September 16, 2018

Why Jefferson Park residents should allow more housing

Share this:

Share this:

Share this:

Share this:

Share this: