Tag: PostgreSQL

Creating a PostgreSQL PostGIS function to get around a DataTables Editor limitation

DataTables is a fantastic software that turns any HTML table into an interactive and editable one.

Chicago Cityscape uses DataTables on pretty much every page. DataTables also provides server-side processing for PHP to grab the right data from a database and show it to the user, 10 records at a time (the number of records can be changed by the user to show more records at a time).

Screenshot showing my new function, using the function, and the results.

Problem

One of the problems I’ve had to get around is that the DataTables Editor script recognizes a SELECT statement with only one function per field. If there’s a second function that’s manipulating a field in the SELECT statement then the ALIAS won’t work and the JavaScript will not be able to read and show the data properly.

I almost always use the two functions ST_AsGeoJSON and ST_Transform on my geographic data to (1) transform it from the SRID of 3435 to 4326 for displaying on web mercator maps, and (2) converting the GIS data into GeoJSON strings for easy transference to the Leaflet JavaScript library.

This is a known issue – see this thread on the DataTables forums – that Allan, the DataTables creator, has acknowledged and provided an alternative solution for.

Solution

It turns out that it’s easy to write a function that combines both functions.

CREATE OR REPLACE FUNCTION ST_TAGJP (field geometry, srid int =4326, simplify int =5)
  RETURNS text
RETURN ST_AsGeoJSON(ST_Transform(field, srid), simplify);

The code above is a function I call ST_TAGJP that combines the two functions I already described, and is flexible by letting the user specify in the arguments the table and field, the SRID to transform (reproject) to, and the simplify variable that can be used in ST_AsGeoJSON that tells it how many decimal points to use in coordinates.

Avoid “VARCHAR” columns in PostgreSQL [PGSQL Phriday #006]

This is a response to Grant.

This is one thing I wish I knew while learning PostgreSQL: Don’t ever use the VARCHAR column type. I have never encountered a situation where that would have been superior to using TEXT.

The problem with VARCHAR column type is that it can be a fixed-width column that doesn’t allow text longer than a certain number of characters. This maximum character length can be introduced by the person who creates the table and adds the column.

The maximum character length can also be created by a GIS program (specifically, QGIS and the ogr2ogr command line tool) as well as some online tools that can convert spreadsheets to SQL and even write the table definition and “CREATE TABLE” statement. In both cases, the program tried to guess the column type and seemed to default to VARCHAR(y character length) instead of TEXT.

The maximum character length is a problem because trying to insert more text than allowed into that field will encounter an error and the INSERT or UPDATE statement will fail. I think it’s also a bad idea to create these columns because it makes it expensive to change your mind later about what length of text will go into that field.

Other things I wish I knew earlier

  • Using transactions to test your query and its affect on your data so that you can easily undo any changes.
  • How and when to use indexes beyond the “btree” index type (for example, “gin” for array and JSONB columns).
  • Add “time_created_at” and “time_updated_at” columns to basically every dataset. This makes it easy to sort things later; sometimes when creating a new dataset it’s not always obvious how you might want to display or manipulate the data until days, weeks, or months later. Sorting data by “when was the information changed last” is a common use case for me.

How to visualize the density of point data in a grid

A common way to show the distribution of places (like grocery stores) is to use a heat map. The map will appear “hotter” where there are many grocery stores and “colder” where there are few grocery stores. This kind of map can be useful to show gaps in distribution or a neighborhood that has a lot of grocery stores.

One issue with that kind of heat map is that the coverage areas change their shape and color if you zoom in, since the algorithm that clusters or determines what’s “nearby” or dense has fewer locations to analyze.

I prefer to use grids in the shape of square tiles, since Chicago is grid-oriented city and the vast majority of our streets and our routes move along east-west and north-south lines. The map above shows the location of subjects and topics of news articles in the Chicago Cityscape database.

I use PostGIS to set up most of my spatial data before visualizing it in QGIS.

This tutorial shows the two steps to using PostGIS to (1) create a grid based on an existing area (polygon), (2) assigning each point location to a tile in that grid and counting the number of locations in that tile.

If you don’t have PostGIS installed, you should install it if you work with spatial data a lot. It is much, much faster at performing most of the tasks you use QGIS or ArcGIS to perform. Both QGIS and ArcGIS can read and edit data stored in PostGIS.

Additionally, there is a function within QGIS that can create grids, and another function that can do comparisons by location and count/summarize, so all of this can be done without PostGIS.

For this tutorial, you will need a single polygon with which to create the grid. I used the boundary of the City of Chicago limits.

  1. Create a grid based on an existing area

1.a. Add a new function to PostGIS

To create a grid, you need a function that draws the tiles based on the polygon. I got this from The Spatial Database Advisor.

-- Create required type
DROP   TYPE IF EXISTS T_Grid CASCADE;
CREATE TYPE T_Grid AS (
  gcol  int4,
  grow  int4,
  geom geometry
);
-- Drop function is exists
DROP FUNCTION IF EXISTS ST_RegularGrid(geometry, NUMERIC, NUMERIC, BOOLEAN);
-- Now create the function
CREATE OR REPLACE FUNCTION ST_RegularGrid(p_geometry   geometry,
                                          p_TileSizeX  NUMERIC,
                                          p_TileSizeY  NUMERIC,
                                          p_point      BOOLEAN DEFAULT TRUE)
  RETURNS SETOF T_Grid AS
$BODY$
DECLARE
   v_mbr   geometry;
   v_srid  int4;
   v_halfX NUMERIC := p_TileSizeX / 2.0;
   v_halfY NUMERIC := p_TileSizeY / 2.0;
   v_loCol int4;
   v_hiCol int4;
   v_loRow int4;
   v_hiRow int4;
   v_grid  T_Grid;
BEGIN
   IF ( p_geometry IS NULL ) THEN
      RETURN;
   END IF;
   v_srid  := ST_SRID(p_geometry);
   v_mbr   := ST_Envelope(p_geometry);
   v_loCol := trunc((ST_XMIN(v_mbr) / p_TileSizeX)::NUMERIC );
   v_hiCol := CEIL( (ST_XMAX(v_mbr) / p_TileSizeX)::NUMERIC ) - 1;
   v_loRow := trunc((ST_YMIN(v_mbr) / p_TileSizeY)::NUMERIC );
   v_hiRow := CEIL( (ST_YMAX(v_mbr) / p_TileSizeY)::NUMERIC ) - 1;
   FOR v_col IN v_loCol..v_hiCol Loop
     FOR v_row IN v_loRow..v_hiRow Loop
         v_grid.gcol := v_col;
         v_grid.grow := v_row;
         IF ( p_point ) THEN
           v_grid.geom := ST_SetSRID(
                             ST_MakePoint((v_col * p_TileSizeX) + v_halfX,
                                          (v_row * p_TileSizeY) + V_HalfY),
                             v_srid);
         ELSE
           v_grid.geom := ST_SetSRID(
                             ST_MakeEnvelope((v_col * p_TileSizeX),
                                             (v_row * p_TileSizeY),
                                             (v_col * p_TileSizeX) + p_TileSizeX,
                                             (v_row * p_TileSizeY) + p_TileSizeY),
                             v_srid);
         END IF;
         RETURN NEXT v_grid;
     END Loop;
   END Loop;
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE
  COST 100
  ROWS 1000;

The ST_RegularGrid function works in the same projection as your source data.

1.b. Create a layer that has all of the tiles for just the Chicago boundary

--This creates grids of 1,320 feet square (a 2x2 block size in Chicago)
SELECT gcol, grow, geom
 into b_chicagoboundary_grid_1320squared
 FROM ST_RegularGrid((select geom from chicagoboundary where gid = 1), 1320, 1320,FALSE);

In that query, “1320” is a distance in feet for both the X and Y planes, as the “chicagoboundary” geometry is projected in Illinois StatePlane FIPS East (Feet) (EPSG/SRID 3435).

2. Assign each point location to a tile in that grid and count the number of locations in each tile

Now you’ll need a table that has POINT-type geometries in it. For the map in this tutorial, I used a layer of location-based news articles that are used in Chicago Cityscape to highlight local developments.

SELECT grid.id, count(*) as count, grid.geom
INTO news_grid
FROM news_articles, b_chicagoboundary_grid_1320squared AS grid
WHERE st_intersects(news_articles.geom, grid.geom)
GROUP by grid.id;

This query will result in a table with three columns:

  1. The ID of the tile, which is a primary key field.
  2. The number of news articles in that tile.
  3. The POLYGON geometry of that tile.
Look at these two maps (the one above, and the one below). The first map shows the whole city. The tiles are colored according to the number of news articles within the area of each tile. The darker the blue, the more news articles within that tile.
This map is zoomed in to the Woodlawn area. As you change scale (zoom in or zoom out), the size of the “heat” area (the size of each tile) doesn’t change – they are still 1,320 feet by 1,320 feet. The color doesn’t change either. The typical heat map doesn’t have these advantages.

This is my process to improve the performance of my PostgreSQL database and queries

Every now and then I’ll run the queries below to help me figure out ways to optimize the queries my application is using, and to figure out if certain tables and views (relations) need to be optimized.

Updated January 18, 2020, to add a query to alter the autovacuum and autoanalyze factors on a per table basis because I have a “cache” table that is updated quite frequently and generates many dead tuples every minute. The table would balloon from 250 MB to 15 GB in a few hours. I hope that reducing these factors to just a threshold of 100 dead tuples will prevent the inflation. I also need to monitor this that it doesn’t make my RDS database work too hard.

/* Show the last time at which every table and view was vacuumed and analyzed */
SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze  
FROM pg_stat_all_tables  d_d
WHERE schemaname = 'public';  

/* See which queries are most popular */
SELECT * FROM pg_stat_statements where calls >=1000; /* replace 1000 with any integer, depending on how active your database is queried */
select pg_stat_statements_reset();

/* Kill a specific query */
select pg_terminate_backend(INTEGER); /* replace INTEGER with the process ID of a specific query */

/* Get the size of the database */
select pg_size_pretty(pg_database_size('DATABASE_NAME')); /* replace DATABASE_NAME with the name of your database */

/* Kill queries that are taking too long; replace DATABASE_NAME with the name of your database */
SELECT query, pg_terminate_backend(pid) 
    FROM pg_stat_activity 
    WHERE datname = 'DATABASE_NAME' 
      AND pid <> pg_backend_pid() 
      AND (state = 'active' 
      OR state = 'idle') 
      AND state_change < current_timestamp - INTERVAL '15' MINUTE; /* change 15 to any integer that you think means a query is taking too long */

/* See which queries are running */
select * from pg_stat_activity;
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL AND backend_type = 'client backend';

/* Find very large indexes and indexes that aren't used much */
WITH table_scans as ( 
    SELECT relid, 
        tables.idx_scan + tables.seq_scan as all_scans, 
        ( tables.n_tup_ins + tables.n_tup_upd + tables.n_tup_del ) as writes, 
                pg_relation_size(relid) as table_size 
        FROM pg_stat_user_tables as tables 
),
all_writes as ( 
    SELECT sum(writes) as total_writes 
    FROM table_scans 
),
indexes as ( 
    SELECT idx_stat.relid, idx_stat.indexrelid, 
        idx_stat.schemaname, idx_stat.relname as tablename, 
        idx_stat.indexrelname as indexname, 
        idx_stat.idx_scan,
        pg_relation_size(idx_stat.indexrelid) as index_bytes, 
        indexdef ~* 'USING btree' AS idx_is_btree 
    FROM pg_stat_user_indexes as idx_stat 
        JOIN pg_index 
            USING (indexrelid) 
        JOIN pg_indexes as indexes 
            ON idx_stat.schemaname = indexes.schemaname 
                AND idx_stat.relname = indexes.tablename 
                AND idx_stat.indexrelname = indexes.indexname 
    WHERE pg_index.indisunique = FALSE 
),
index_ratios AS ( 
SELECT schemaname, tablename, indexname, 
    idx_scan, all_scans,
    round(( CASE WHEN all_scans = 0 THEN 0.0::NUMERIC 
        ELSE idx_scan::NUMERIC/all_scans * 100 END),2) as index_scan_pct, 
    writes,
    round((CASE WHEN writes = 0 THEN idx_scan::NUMERIC ELSE idx_scan::NUMERIC/writes END),2) 
        as scans_per_write, 
    pg_size_pretty(index_bytes) as index_size, 
    pg_size_pretty(table_size) as table_size, 
    idx_is_btree, index_bytes
    FROM indexes 
    JOIN table_scans 
    USING (relid) 
),
index_groups AS ( 
SELECT 'Never Used Indexes' as reason, *, 1 as grp 
FROM index_ratios 
WHERE
    idx_scan = 0
    and idx_is_btree 
UNION ALL 
SELECT 'Low Scans, High Writes' as reason, *, 2 as grp 
FROM index_ratios 
WHERE
    scans_per_write <= 1
    and index_scan_pct < 10 
    and idx_scan > 0 
    and writes > 100 
    and idx_is_btree 
UNION ALL 
SELECT 'Seldom Used Large Indexes' as reason, *, 3 as grp 
FROM index_ratios 
WHERE
    index_scan_pct < 5
    and scans_per_write > 1 
    and idx_scan > 0 
    and idx_is_btree 
    and index_bytes > 100000000 
UNION ALL 
SELECT 'High-Write Large Non-Btree' as reason, index_ratios.*, 4 as grp 
FROM index_ratios, all_writes 
WHERE
    ( writes::NUMERIC / ( total_writes + 1 ) ) > 0.02 
    AND NOT idx_is_btree 
    AND index_bytes > 100000000 
ORDER BY grp, index_bytes DESC ) 
SELECT reason, schemaname, tablename, indexname, 
    index_scan_pct, scans_per_write, index_size, table_size
FROM index_groups; 

/* Modify the autovacuum and autoanalyze factors for a single table 
https://klotzandrew.com/blog/posgres-per-table-autovacuum-management */
ALTER TABLE cache SET (autovacuum_analyze_scale_factor = 0, autovacuum_analyze_threshold = 100);

WITH raw_data AS (
  SELECT
    pg_namespace.nspname,
    pg_class.relname,
    pg_class.oid AS relid,
    pg_class.reltuples,
    pg_stat_all_tables.n_dead_tup,
    pg_stat_all_tables.n_mod_since_analyze,
    (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_analyze_scale_factor=' ) as c_analyze_factor,
    (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_analyze_threshold=' ) as c_analyze_threshold,
    (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_vacuum_scale_factor=' ) as c_vacuum_factor,
    (SELECT split_part(x, '=', 2) FROM unnest(pg_class.reloptions) q (x) WHERE x ~ '^autovacuum_vacuum_threshold=' ) as c_vacuum_threshold,
    to_char(pg_stat_all_tables.last_vacuum, 'YYYY-MM-DD HH24:MI:SS') as last_vacuum,
    to_char(pg_stat_all_tables.last_autovacuum, 'YYYY-MM-DD HH24:MI:SS') as last_autovacuum
  FROM
    pg_class
  JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.oid
    LEFT OUTER JOIN pg_stat_all_tables ON pg_class.oid = pg_stat_all_tables.relid
  WHERE
    n_dead_tup IS NOT NULL
    AND nspname NOT IN ('information_schema', 'pg_catalog')
    AND nspname NOT LIKE 'pg_toast%'
    AND pg_class.relkind = 'r'
), data AS (
  SELECT
    *,
    COALESCE(raw_data.c_analyze_factor, current_setting('autovacuum_analyze_scale_factor'))::float8 AS analyze_factor,
    COALESCE(raw_data.c_analyze_threshold, current_setting('autovacuum_analyze_threshold'))::float8 AS analyze_threshold,
    COALESCE(raw_data.c_vacuum_factor, current_setting('autovacuum_vacuum_scale_factor'))::float8 AS vacuum_factor,
    COALESCE(raw_data.c_vacuum_threshold, current_setting('autovacuum_vacuum_threshold'))::float8 AS vacuum_threshold
  FROM raw_data
)
SELECT
  relid,
  nspname,
  relname,
  reltuples,
  n_dead_tup,
  ROUND(reltuples * vacuum_factor + vacuum_threshold) AS v_threshold,
  n_mod_since_analyze,
  ROUND(reltuples * analyze_factor + analyze_threshold) AS a_threshold,
  c_analyze_factor as caf,
  c_analyze_threshold as cat,
  c_vacuum_factor as cvf,
  c_vacuum_threshold as cvt,
  analyze_factor as af,
  analyze_threshold as at,
  vacuum_factor as vf,
  vacuum_threshold as vt,
  last_vacuum,
  last_autovacuum
FROM
  data
ORDER BY n_dead_tup DESC;

Pick the lowest and highest numbers in an array of numbers in PostgreSQL

I thought solving this problem took longer than it should have. I thought there would have been an integrated function in PostgreSQL to pick the lowest (smallest or minimum) and highest (largest or maximum) numbers in an ARRAY of numbers.

LEAST and GREATEST didn’t work, since those work on expressions, not arrays.

MIN and MAX don’t work because they are aggregating functions, and I didn’t want that.

Of course I found the solution on StackOverflow, but not after a lot of searching and trying other potential solutions.

Here it is!

Given an array of numbers, pick the lowest and highest ones using two custom functions.

CREATE OR REPLACE FUNCTION small(anyarray, int)

 RETURNS anyelement AS $$

  SELECT (ARRAY(SELECT unnest($1) ORDER BY 1 asc))[$2]

 $$ LANGUAGE sql;

The second argument in this function is to extract the Nth smallest number. In my case I want the smallest number so I set “1” for the second argument.

Example array in PostgreSQL:

{45.04,124.90,45.04,124.90}

Example query:

SELECT small(‘{45.04,124.90,45.04,124.90}’::numeric[], 1)

Output: 45.04

You can rewrite the query to select the Nth largest number by changing the “ORDER BY 1 asc” to “ORDER BY 1 desc” (reversing the order of the array’s unnesting.)

CREATE OR REPLACE FUNCTION small(anyarray, int)

RETURNS anyelement AS $$

SELECT (ARRAY(SELECT unnest($1) ORDER BY 1 asc))[$2]

$$ LANGUAGE sql;

Example array in PostgreSQL:

{45.04,124.90,45.04,124.90}

Example query:

SELECT large(‘{45.04,124.90,45.04,124.90}’::numeric[], 1)

Output: 124.9