database performance – Steven Can Plan

/* Show the last time at which every table and view was vacuumed and analyzed */ SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_all_tables d_d WHERE schemaname = 'public'; /* See which queries are most popular */ SELECT * FROM pg_stat_statements where calls >=1000; /* replace 1000 with any integer, depending on how active your database is queried */ select pg_stat_statements_reset(); /* Kill a specific query */ select pg_terminate_backend(INTEGER); /* replace INTEGER with the process ID of a specific query */ /* Get the size of the database */ select pg_size_pretty(pg_database_size('DATABASE_NAME')); /* replace DATABASE_NAME with the name of your database */ /* Kill queries that are taking too long; replace DATABASE_NAME with the name of your database */ SELECT query, pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'DATABASE_NAME' AND pid <> pg_backend_pid() AND (state = 'active' OR state = 'idle') AND state_change < current_timestamp - INTERVAL '15' MINUTE; /* change 15 to any integer that you think means a query is taking too long */ /* See which queries are running */ select * from pg_stat_activity; SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL AND backend_type = 'client backend'; /* Find very large indexes and indexes that aren't used much */ WITH table_scans as ( SELECT relid, tables.idx_scan + tables.seq_scan as all_scans, ( tables.n_tup_ins + tables.n_tup_upd + tables.n_tup_del ) as writes, pg_relation_size(relid) as table_size FROM pg_stat_user_tables as tables ), all_writes as ( SELECT sum(writes) as total_writes FROM table_scans ), indexes as ( SELECT idx_stat.relid, idx_stat.indexrelid, idx_stat.schemaname, idx_stat.relname as tablename, idx_stat.indexrelname as indexname, idx_stat.idx_scan, pg_relation_size(idx_stat.indexrelid) as index_bytes, indexdef ~* 'USING btree' AS idx_is_btree FROM pg_stat_user_indexes as idx_stat JOIN pg_index USING (indexrelid) JOIN pg_indexes as indexes ON idx_stat.schemaname = indexes.schemaname AND idx_stat.relname = indexes.tablename AND idx_stat.indexrelname = indexes.indexname WHERE pg_index.indisunique = FALSE ), index_ratios AS ( SELECT schemaname, tablename, indexname, idx_scan, all_scans, round(( CASE WHEN all_scans = 0 THEN 0.0::NUMERIC ELSE idx_scan::NUMERIC/all_scans * 100 END),2) as index_scan_pct, writes, round((CASE WHEN writes = 0 THEN idx_scan::NUMERIC ELSE idx_scan::NUMERIC/writes END),2) as scans_per_write, pg_size_pretty(index_bytes) as index_size, pg_size_pretty(table_size) as table_size, idx_is_btree, index_bytes FROM indexes JOIN table_scans USING (relid) ), index_groups AS ( SELECT 'Never Used Indexes' as reason, *, 1 as grp FROM index_ratios WHERE idx_scan = 0 and idx_is_btree UNION ALL SELECT 'Low Scans, High Writes' as reason, *, 2 as grp FROM index_ratios WHERE scans_per_write <= 1 and index_scan_pct < 10 and idx_scan > 0 and writes > 100 and idx_is_btree UNION ALL SELECT 'Seldom Used Large Indexes' as reason, *, 3 as grp FROM index_ratios WHERE index_scan_pct < 5 and scans_per_write > 1 and idx_scan > 0 and idx_is_btree and index_bytes > 100000000 UNION ALL SELECT 'High-Write Large Non-Btree' as reason, index_ratios.*, 4 as grp FROM index_ratios, all_writes WHERE ( writes::NUMERIC / ( total_writes + 1 ) ) > 0.02 AND NOT idx_is_btree AND index_bytes > 100000000 ORDER BY grp, index_bytes DESC ) SELECT reason, schemaname, tablename, indexname, index_scan_pct, scans_per_write, index_size, table_size FROM index_groups; /* Modify the autovacuum and autoanalyze factors for a single table https://klotzandrew.com/blog/posgres-per-table-autovacuum-management */ ALTER TABLE cache SET (autovacuum_analyze_scale_factor = 0, autovacuum_analyze_threshold = 100); WITH raw_data AS ( SELECT pg_namespace.nspname, pg_class.relname, pg_class.oid AS relid, pg_class.reltuples, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.n_mod_since_analyze, (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_analyze_scale_factor=' ) as c_analyze_factor, (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_analyze_threshold=' ) as c_analyze_threshold, (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_vacuum_scale_factor=' ) as c_vacuum_factor, (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_vacuum_threshold=' ) as c_vacuum_threshold, to_char(pg_stat_all_tables.last_vacuum, 'YYYY-MM-DD HH24:MI:SS') as last_vacuum, to_char(pg_stat_all_tables.last_autovacuum, 'YYYY-MM-DD HH24:MI:SS') as last_autovacuum FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.oid LEFT OUTER JOIN pg_stat_all_tables ON pg_class.oid = pg_stat_all_tables.relid WHERE n_dead_tup IS NOT NULL AND nspname NOT IN ('information_schema', 'pg_catalog') AND nspname NOT LIKE 'pg_toast%' AND pg_class.relkind = 'r' ), data AS ( SELECT *, COALESCE(raw_data.c_analyze_factor, current_setting('autovacuum_analyze_scale_factor'))::float8 AS analyze_factor, COALESCE(raw_data.c_analyze_threshold, current_setting('autovacuum_analyze_threshold'))::float8 AS analyze_threshold, COALESCE(raw_data.c_vacuum_factor, current_setting('autovacuum_vacuum_scale_factor'))::float8 AS vacuum_factor, COALESCE(raw_data.c_vacuum_threshold, current_setting('autovacuum_vacuum_threshold'))::float8 AS vacuum_threshold FROM raw_data ) SELECT relid, nspname, relname, reltuples, n_dead_tup, ROUND(reltuples * vacuum_factor + vacuum_threshold) AS v_threshold, n_mod_since_analyze, ROUND(reltuples * analyze_factor + analyze_threshold) AS a_threshold, c_analyze_factor as caf, c_analyze_threshold as cat, c_vacuum_factor as cvf, c_vacuum_threshold as cvt, analyze_factor as af, analyze_threshold as at, vacuum_factor as vf, vacuum_threshold as vt, last_vacuum, last_autovacuum FROM data ORDER BY n_dead_tup DESC;

You should use ST_Subdivide to break up large shapes into smaller ones

This infographic visually compares the difference between running a PostGIS comparison query like ST_Intersects on a large shape versus a subdivided version of that large shape. Click to embiggen.

Hundreds of GIS intersection comparisons are completed every hour on Chicago Cityscape.*

People are looking at, say, a map of the South Shore community area. That “Place” page then grabs all of the building permits, building violations, business licenses, and other “feature layers” that are stored as points.

A classic “point in polygon” comparison is made using the ST_Intersects(place_geometry, permits_geometry) function.

This has worked very well for several years.

The problem

But as Chicago Cityscape handles larger shapes – they come from users drawing their own, large shapes, and from large shapes like the downtown Chicago area – this query doesn’t cut it.

Setting indexes on the geometry is imperative, but it’s not the end of the to optimize performance. That’s because the index of the geometry is a rectangular bounding box (which is also called an “envelope” in GIS) that contains the entire shape of the South Shore community area.

The downtown Chicago area, however, is not even the largest shape I have. That belongs to the new Place, “Neighborhood Opportunity Fund investment zones” (NOF). Combined, they cover 75 square miles of Chicago. Downtown is only 7.7 square miles.

After I added the NOF map and tested its Place page, it “crashed” my server, metaphorically speaking. The query to just count the number of building permits in the area would take about five minutes.

There had to be a better way; in the meantime, however, I divided the NOF map into the West and South sections. This hardly improved the counting time.

The solution

Thankfully, today, I saw a tweet from Paul Ramsey linking to his blog that linked to his slides from a recent presentation about the use of PostgreSQL to store and manipulate GIS data.

In it he explained how the ST_Subdividefunction worked. I’m going to demonstrate it using graphics from my own maps.

A normal intersection comparison, using ST_Intersects(place_geometry, permits_geometry) in a query creates a bounding box (envelope) around each geometry and quickly determines whether the two envelopes overlap. If they do, then it checks again to see if the actual geometries overlap. If they do, that data is returned as a response to your query.

When your two datasets are massive, like the NOF zones, which collectively cover 1/3rd of Chicago, and the building permits, which are found across the entire city…well, that led to the five minutes counting time.

Enter ST_Subdivide. To use it properly you would run it against your existing geometry and store the much smaller shapes, derived from the big shape, in a new table. I applied the function to all the 22,203 maps that Chicago Cityscape has and stored their unique IDs and subdivided geometries in a new “lookup” table.

Now, any time I want to compare the building permits against the NOF, the building permits are instead compared to the small shapes that were subdivided.

The query

Chicago Cityscape uses a single table (created as a materialized view) to combine all 22,203 maps. Each map is stored in a source table (for example, there’s a table to hold the 77 community areas) and the materialized view runs once a day to combine all of the maps in the source tables. This ensures our data is managed well: different source tables can hold different information, and the single table holds only the name, type, and geometry of the source tables, for faster comparison. Each entry in the single table also has a “slug”, its unique identifier.

Thus, the materialized view of the subdivided maps is created from the aforementioned single table, using this query:

create materialized view view_places_subdivided as

select gid || '_' || random() as gid, slug, st_subdivide(geom) as geom

from view_places;

The “gid” is designed to create a new unique ID field, as the slug field will be repeated for every subdivided of each map. A unique ID field is necessary if you want to refresh the materialized view concurrently (to allow for other queries to access the materialized view while it’s being refreshed).

* The results are cached for a few hours, because the feature layers change 1-2 times per day and at different times each day, so the limited duration cache accommodates that. Ideally I would code a way to invalidate the cache when the feature layer data is updated.

Update 12/31/19: ST_Subdivide will fail if your geometries have any or certain geometry errors (I don’t know if it’s any kind of error, or certain kinds of errors that make the function fail). Chicago Cityscape has over 37,000 features that ST_Subdivide is attempting to process, and there is a lot of room for error in managing that many features from dozens of sources.

Selected series of posts

Personal Links

Receive new posts via email

Recent Posts

Tag: database performance

This is my process to improve the performance of my PostgreSQL database and queries

The genius of using ST_Subdivide to speed up PostGIS intersection comparisons

Share this:

Share this: