Category: Data

Developing a method to score Divvy station connectivity

A Divvy station at Halsted/Roscoe in Boystown, covered in snow after the system was shutdown for the first time to protect workers and members. Photo by Adam Herstein.

In researching for a new Streetsblog Chicago article I’m writing about Divvy, Chicago’s bike-share system, I wanted to know which stations (really, neighborhoods) had the best connectivity. They are nodes in a network and the bike-share network’s quality is based on how well (a measure of time) and how many ways one can move from node to node.

I read Institute for Transportation Development Policy’s (ITDP) report “The Bike-Share Planning Guide” [PDF] says that one station every 300 meters (984 feet) “should be the basis to ensure mostly uniform coverage”. They also say there should be 10 to 16 stations per square kilometer of the coverage area, which has a more qualitative definition. It’s really up to the system designer, but the report says “the coverage area must be large enough to contain a significant set of users’ origins and destinations”. If you make it too small it won’t meaningfully connect places and “the system will have a lower chance of success because its convenience will be compromised”. (I was inspired to research this after reading coverage of the report in Next City by Nancy Scola.)

Since I don’t yet know the coverage area – I lack the city’s planning guide and geodata – I’ll use two datasets to see if Chicago meets the 300 meters/984 feet standard.

Dataset 1

The first dataset I created was a distance matrix in QGIS that measured the straight-line distance between each station and its eight nearest stations. This means I would cover a station in all directions, N, S, E, W, and NW, NE, SE, and SW. Download first dataset, distance matrix.

Each dataset offers multiple ways to gauge connectivity. The first dataset, using a straight-line distance method, gives me mean, standard deviation, maximum value, and minimum value. I sorted the dataset by mean. A station with the lowest mean has the greatest number of nearby stations; in other words, most of its nearby stations are closer to it than the next station in the list.

Sorting the first dataset by lowest mean gives these top five best-connected stations:

  1. Canal St & Monroe St, a block north of Union Station (191), mean of 903.96 feet among nearest 8 stations
  2. Clinton St & Madison St, outside Presidential Towers and across from Northwestern Train Station (77), 964.19 feet
  3. Canal St & Madison St, outside Northwestern Train Station (174), 972.40
  4. Canal St & Adams St, north side of Union Station’s Great Hall (192), 982.02
  5. State St & Randolph St, outside Walgreens and across from Block 37 (44), 1,04.19

The least-connected stations are:

  1. Prairie Ave & Garfield Blvd (204), where the nearest station is 4,521 feet away (straight-line distance), or 8.8x greater than the best-connected station, and the mean of the nearest 8 stations is 6,366.82 feet (straight-line distance)
  2. California Ave & 21st St (348), 6,255.32
  3. Kedzie Ave & Milwaukee Ave (260), 5,575.30
  4. Ellis Ave & 58th St (328), 5,198.72
  5. Shore Drive & 55th St (247), 5,168.26

Dataset 2

The second dataset I manipulated is based on Alex Soble’s DivvyBrags Chrome extension that uses a distance matrix created by Nick Bennett (here’s the file) that estimates the bicycle route distance between each station and every other station. This means 88,341 rows! Download second dataset, distance by bike – I loaded it into MySQL to use its maths function, but you could probably use python or R.

The two datasets had some overlap (in bold), but only when finding the stations with the lowest connectivity. In the second dataset, using the estimated bicycle route distance, ranking by the number of stations within 2.5 miles, or the distance one can bike in 30 minutes (the fee-free period) at 12 MPH average, the following are the top five best-connected stations:

  1. Ogden Ave & Chicago Ave, 133 stations within 2.5 miles
  2. Green St & Milwaukee Ave, 131
  3. Desplaines St & Kinzie St, 129
  4. (tied) Larrabee St & Kingsbury St and Carpenter St & Huron St, 128
  5. (tied) Clinton St & Lake St and Green St & Randolph St, 125

Notice that none of these stations overlap with the best-connected stations and none are downtown. And the least-connected stations (these stations have the fewest nearby stations) are:

  1. Shore Drive & 55th St, 11 stations within 2.5 miles
  2. (tied) Ellis Ave & 58th St and Lake Park Ave & 56th St, 12
  3. (tied) Kimbark Ave & 53rd St and Blackstone Ave & Hyde Park Blvd and Woodlawn Ave & 55th St, 13
  4. Prairie Ave & Garfield Blvd, 14
  5. Cottage Grove Ave & 51st St, 15

This, the second dataset, gives you a lot more options on devising a complex or weighted scoring system. For example, you could weight certain factors slightly higher than the number of stations accessible within 2.5 miles. Or you could multiply or divide some factors to obtain a different score.

I tried another method on the second dataset – ranking by average instead of nearby station quantity – and came up with a completely different “highest connectivity” list. Stations that appeared in the least-connected stations list showed up as having the lowest average distance from that station to every other station that was 2.5 miles or closer. Here’s that list:

  1. Kimbark Ave & 53rd St – 13 stations within 2.5 miles, 1,961.46 meters average distance to those 13 stations
    Blackstone Ave & Hyde Park Blvd – 13 stations, 2,009.31 meters average
    Woodlawn Ave & 55th St – 13 stations, 2,027.54 meters average
  2. Cottage Grove Ave & 51st St – 15 stations, 2,087.73 meters average
  3. State St & Kinzie St – 101 stations, 2,181.64 meters average
  4. Clark St & Randolph St – 111 stations, 2,195.10 meters average
  5. State St & Wacker Dr – 97 stations, 2,207.10 meters average

Back to 300 meters

The original question was to see if there’s a Divvy station every 300 meters (or 500 meters in outlying areas and areas of lower demand). Nope. Only 34 of 300 stations, 11.3%, have a nearby station no more than 300 meters away. 183 stations have a nearby station no further than 500 meters – 61.0%. (You can duplicate these findings by looking at the second dataset.)

Concluding thoughts

ITDP’s bike-share planning guide says that “residential population density is often used as a proxy to identify those places where there will be greater demand”. Job density and the cluster of amenities should also be used, but for the purposes of my analysis, residential density is an easy datum to grab.

It appears that stations in Woodlawn, Washington Park, and Hyde Park west of the Metra Electric line fare the worst in station connectivity. The 60637 ZIP code (representing those neighborhoods) contains half of the least-connected stations and has a residential density of 10,468.9 people per square mile while 60642, containing 3 of the 7 best-connected stations, has a residential density of 11,025.3 people per square mile. There’s a small difference in density but an enormous difference in station connectivity.

However, I haven’t looked at the number of stations per square mile (again, I don’t know the originally planned coverage area), nor the rise or drop in residential density in adjacent ZIP codes.

There are myriad other factors to consider, as well, including – according to ITDP’s report – current bike mode share, transit and bikeway networks, and major attractions. It recommends using these to create a “demand profile”.

Station density is important for user convenience, “to ensure users can bike and park anywhere” in the coverage area, and to increase market penetration (the number of people who will use the bike-share system). When Divvy and the Chicago Department of Transportation add 175 stations this year – some for infill and others to expand the coverage area – they should explore the areas around and between the stations that were ranked with the lowest connectivity to decrease the average distance to its nearby stations and to increase the number of stations within 2.5 miles (the 12 MPH average, 30-minute riding distance).

N.B. I was going to make a map, but I didn’t feel like spending more time combining the datasets (I needed to get the geographic data from one dataset to the other in order to create a symbolized map). 

Getting a little closer to understanding Chicago’s pothole-filling performance status

Tom Kompare updated his web application that tracks the progress of potholes based on information in the city’s data portal in response to my query about how many potholes the city fills within 72 hours, which is the Chicago Department of Transportation’s performance measure.

He wrote to me via the Open Government Chicago group:

Without completely rewriting http://potholes.311services.org, I added a count of the number of open (not yet addressed) pothole repair tickets (requests) that exceed 3 days old. As of today, the data from the City of Chicago’s Data Portal shows 1,334 or the 1,404 open tickets in the 311 system are older than three days.

Full disclosure: The web app actually looks for greater than 4 days old. The Data Portal’s pothole data are only updated once a day, so these data are always a day old. 4 – 1 = 3.

Keep in mind that this web app only shows how many are yet to be addressed, and does not count how many have been patched within CDOT’s 3-day goal during some arbitrary time period. That is a much more intense calculation that this pure client-side Javascript web application can handle due to bandwidth restrictions on mobile (3/4G). This web app already pushes the mobile envelope with the amount of data downloaded. I can fix that, but, again, not without a rewrite.

Still, 1,334 open repair requests (12/16/2013 Data Portal data) is quite different than the number of open repair requests reported by CDOT (560 in Alley, 193 on street) on 12/16/2013. I’m not sure what is the difference.

This reminds me of a third issue with the way CDOT is presenting pothole performance data online (the first being that it’s PDF, the second that it doesn’t work in Safari). The six PDF files are overwritten for every new day of data. If you want information from two days ago, well you better have downloaded the PDF from two days ago!

CTA fare breakdown for Ventra and fares it replaces

This CTA graphic shows all the fare media Ventra replaces. 

The Chicago Transit Authority expanded its pilot contactless card fare payment technology systemwide in 2002, and introduced Chicago Card Plus, which added the benefit of linking to a credit/debit card, in 2004. After 11 years, the two cards were hardly “popular” as Jon Hilkevitch called them today. In the context of his article I believe he meant “liked” or “admired” and not widespread, as Ventra does not have the same admiration because of all of the issues people are experiencing.

While Chicago Card/Plus users likely preferred this fare payment over magnetic stripe, for their convenience and speed, a minority of passengers used it.

Data from CTA for January to July 2013, representing 1.6 million average weekday rides.

Magnetic Stripe: 75%
CCP & CC: 19% (17% & 2% respectively)
Bus Cash: 6%

Ventra? 69% this week.

CDOT misses the lesson on open data transparency

Publishing the wrong measurement as a PDF isn’t transparency.

The Chicago Department of Transportation released the first progress report to its Chicago Forward Action Agenda in October, two and a half years after the plan – the first of its kind – was published. I’ve spent an inordinate amount of time reading it and putting off a review. Why? It’s been a difficult to compare the original and update documents. The update is extremely light on specifics and details for the many goals in the Action Agenda, which should have organizational (like record keeping and efficiency improvements) and public impacts (like figuring out which intersections have the most crashes). I’ll publish my in-depth review this week.

Aside from missing specifics and details, the update presents information differently and is missing status updates for the three to five “performance measures” in each chapter. It was difficult to understand CDOT’s reporter progress without holding the original and update side-by-side. I think listing the original action item, the progress symbol, and then a status update would have been an easier way to read the document.

The update measures some action items differently than originally called for, and the way pothole repair was presented, a problem for people bicycling and driving, caught my analytical eye.

CDOT states a pothole-filling performance measure of the percentage, which it desires to be increased, “patched or fixed within 72 hours of being reported” but the average, according to the website Chicago Potholes, which tracks the city’s open data, is 101 days*. The update doesn’t necessarily explain why, writing “the 72 hour goal for filling potholes is not always feasible due to asphalt plant schedules” and nothing related to the performance measure.

As originally written, the only way to note the performance would be to list the percentage of potholes filled within the goal time, at the beginning and in the update. This performance measure has a complementary action item – an online dashboard – which could have provided the answer, but didn’t.

CDOT published that dashboard this summer as a series of six PDF files that update daily and you can hardly call it useful.

Publishing PDF files in the day and age of open government data – popular with President Obama and Mayor Rahm Emanuel – is unacceptable. Even if they are accessible – meaning you can copy/paste the text – they are poor outlets for data given the nationally-renowned civic innovation changes that Emanuel has succeeded in establishing.

There’s another problem: the dashboard file for pothole tracking doesn’t track the time it takes to close a pothole request, nor the number of pothole requests that are patched within 72 hours. It simply tells the number completed yesterday, the year to date, and the number of unpatched requests. (I’ve posted the pothole-tracking file to Scribd because the dashboard [PDF] doesn’t work in Safari; I also notified city staff to this problem which they acknowledged over three weeks ago.)

The “Chicago Works For You” website reports a different metric, that of the number of requests made each day, distributed by ward.

I discussed the proposed dashboard with former commissioner Gabe Klein over two years ago. He said he wanted to create a dashboard of projects “we’re working on that’s updated once a week.” Given Klein’s high professional accessibility to myself, John Greenfield and other reporters, I’ll give him and CDOT a pass for not doing this. But Klein also said, “I’m really big on transparency and good communication. When I left [Washington,] D.C. our [Freedom of Information Act Requests] were dramatically lowered.”

I’ll consider the pothole performance measure and action item “in need of major progress.”

* For stats geeks, the median is 86 and standard deviation is ±84.

Why do speeding crashes in Chicago lead to worse injuries?

Don’t git behind me. Photo by Richard Masoner. 

A discussion about Chicagoans’ proclivity for tailgating (on a post about speed cameras) prompted me to look at the prevalence of this in causing crashes. I looked at the three-year period of 2010-2012 first, mainly so the numbers wouldn’t be so large, and left this information in a comment. But considering the prerequisites* for a crash to be reported in this dataset, and my desire to compare two multi-year periods, I switched my analysis to the single four-year period 2009-2012.

2009-2012

Total crashes: 318,193. Total fatalities: 554 people.

Tailgating crashes

62,080 crashes, 19.53% of all crash types

Tailgating crashes, injuries breakdown:

  • Killed: .0012 (this represents the number of deaths per crash). 75 people died in these crashes, representing 13.54% of all deaths.
  • Incapacitating injuries: 8.53% (the average distribution of people’s injuries in all tailgating crashes)
  • Non-Incapacitating: 46.32%
  • Possible injury: 45.15%

The share of all crash types that are tailgating has increased steadily from 18.11% in 2009 to 20.79% in 2012.

Speeding crashes

10,339 crashes, 3.24% of all crash types

Speeding injuries:

  • Killed: .0118 (this represents the number of deaths per crash). 122 people died in these crashes, representing 22.02% of all deaths.
  • Incapacitating injuries: 15.55% (the average distribution of people’s injuries in all speeding crashes)
  • Non-Incapacitating: 51.95%
  • Possible injury: 32.50%

The share of all crash types that are tailgating has decreased slightly from 3.72% in 2009 to 3.02% in 2012. While speeding leads to fewer crashes, it leads to a greater incidence of death and serious injury. The probability of a speeding crash leading to at least one death seems to stay steady through the period while the probability of seeing a person with an incapacitating injury versus a different kind of injury varies more, but not so much in a range that overlaps the rates for tailgating crashes.

A future comparison at injuries should look at the top crash causes for death and serious injury.

N/A and Unable to determine crashes

237,729 crashes, 74.71% of all crash types

N/A and unable to determine injuries:

  • Killed: .0013 (this represents the number of deaths per crash). 305 people died in these crashes, representing 55.05% of all deaths.
  • Incapacitating injuries: 9.38% (the average distribution of people’s injuries in all N/A crashes)
  • Non-Incapacitating: 48.26%
  • Possible injury: 42.35%

Notes

Updated December 4, 2013

I updated the wording on how to interpret these numbers. For example, previously for “killed” there was a percentage saying this number represented the amount of crashes that had at least one death. This wasn’t accurate: the same number represents a rate of deaths per crash of that type. Injury percentages represent the distribution of injury types experienced by all the people injured in crashes of that type.

Reliability

Analyzing crash causes is not very reliable as 45.60% of the reported crashes in 2012 had “N/A” or “unable to determine” listed as the primary cause! The third and fourth most frequently ascribed causes were the two tailgating codes (described below). There are some crashes that had the one of these two causes in the secondary cause field but I haven’t calculated that.

Cause code descriptions

Each crash has two cause codes. For tailgating crashes I searched for reports where “failing to reduce speed to avoid crash” or “following too closely” in either the primary or secondary cause field (it’s possible that a report had both of these causes ascribed). For speeding crashes I searched for “speed excessive for conditions” or “exceeding speed limit” in either the primary or secondary cause fields.

Prerequisites

This data excludes crashes where there was no injury or no property damage greater than $500 (2005 to 2008) and $1,500 (2009 to 2012). You cannot compare the two datasets when you want to see a share of all crashes because the number of “all crashes” will be underreported in the second dataset.

Queries

These are some of the MySQL queries I used to get the data out of my own crash database (I’m figuring out ways to make it public, using a shared login). “Cause 1 code” indicates the primary cause of the crash according to the police officer’s judgement. “Cause 2 code” indicates the secondary cause of the crash according to the police officer’s judgement.

1. Crash cause reliability: SELECt count(casenumber), sum(`Total killed`), `Cause2`, `Cause 2 code` FROM `CrashExtract_Chicago` WHERE year = 12 GROUP BY `Cause 2 code`  ORDER BY cast(`Cause 2 code` as signed)

2. Speeding crashes: SELECT count(casenumber), sum(`Total killed`), sum(`totalInjuries`), sum(`A injuries`), sum(`B injuries`), sum(`C injuries`) FROM `CrashExtract_Chicago` WHERE (`Cause 1 code` = 1 OR `Cause 1 code` = 27 OR `Cause 2 code` = 1 or `Cause 2 code` = 27) AND year > 8

3. Tailgating crashes: SELECT count(casenumber), sum(`Total killed`), sum(`totalInjuries`), sum(`A injuries`), sum(`B injuries`), sum(`C injuries`) FROM `CrashExtract_Chicago` WHERE (`Cause 1 code` = 3 OR `Cause 1 code` = 28 OR `Cause 2 code` = 3 or `Cause 2 code` = 28) AND year > 8

4. N/A and Unable to determine crashes: SELECT count(casenumber), sum(`Total killed`), sum(`totalInjuries`), sum(`A injuries`), sum(`B injuries`), sum(`C injuries`) FROM `CrashExtract_Chicago` WHERE (`Cause 1 code` = 18 OR `Cause 1 code` = 99) AND year > 8