Postgres Storage Optimization

Rishabh Jain
Dev Genius
Published in
7 min readOct 8, 2022

--

How to optimize the storage of the Postgres database that can ultimately help in optimizing queries contributing to overall database performance?

Most of the time we keep on adding storage to our Postgres Heap when storage seems full as adding storage is cheap.

Is this right way to handle increasing demand for storage or we can handle it gracefully without increasing the storage that will help in decreasing cost and also optimizing performance?

First, understand why storage keeps on increasing. Few Notable Potentials:

  1. Unused Indexes
  2. Bloat

Let’s discuss both in detail.

Unused Indexes:

  1. Indexes were created during table creation and never got used.
  2. Indexes were created with the wrong index type(b-tree, hash, gin) and we queried those columns with unsupported operators.
  3. We created the index initially and used but after a certain time we stop using that and it becomes a candidate for an unused index.

Query to find unused indexes:

selectrelname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch,pg_size_pretty(pg_relation_size(indexrelname::regclass)) as sizefrompg_stat_all_indexeswhereschemaname = 'public'and indexrelname not like 'pg_toast_%'and idx_scan = 0and idx_tup_read = 0and idx_tup_fetch = 0and indexrelname != 'SequelizeData_pkey'order bypg_relation_size(indexrelname::regclass) desc;

idx_scan = number of index scans

size = total storage taken by index

idx_tup_read = number of index entries returned by index scan

idx_tup_fetch = number of live table rows returned by simple index scan

We can’t remove all the unused indexes as they might be used later or postgres might be showing some wrong stats.

Note: It is advised to reset the index stats regularly to get the better view of used indexes. To reset we first need to find the object identifier id (oid) for the table and then reset.

SELECT oid FROM pg_class c WHERE relname = ‘test’; // test is table nameSELECT pg_stat_reset_single_table_counters(oid_fetched); // Reset to get new stats regularly

Bloat:

Postgres works with MVCC (Multiversion Concurrency Control) that ultimately creates a new row for update/insert. This helps in concurrent transactions with older transactions having access to an older version of data(as they need to be completed) but at the same time, it’s adding a lot of stale storage in the system. So at a time, only one version of a row will be active and all others will present as dead(stale) rows.

Bloat also exists in indexes. Suppose you update a tuple(row) in the table and with 3 columns are indexed for that table. Now with a new tuple created for the update part(if we update data for all these columns), a corresponding new entry will be created in all three indexes and earlier references will be marked stale.

Note: Postgres uses HOT(Heap Only Tuple) optimization in which suppose the table has 3 indexed columns and now on update we are updating only one indexed column so in this case reference of 2 old indexed data will be changed reference from old to the new tuple and only new index data is created for an indexed updated column.

Bloat will only be created in the database if the rate of creation of rows(tuples) is greater than the rate of tuples getting cleaned. Also, Autovaccum(Background Process) removes the records from files but still, stoarge is not cleared and we need to do Defragmentation.

More Bloat results in more I/O which ultimately degrades query performance.

One of the simple ways to view bloat:

CREATE EXTENSION pgstattuple;select * from pgstattuple(‘sales’); // sales is table name.

Now I will update/delete some rows and see the result. We can see the dead couple count and the corresponding length.

Now from the above discussion we got to know, Bloats are created on:

  1. Tables
  2. Indexes

Table Bloat and Index Bloat can be fetched easily with queries given by some awesome person. Simple copy and run and we will get all the data

Git Link: https://github.com/ioguix/pgsql-bloat-estimation/blob/master/table/table_bloat.sql

Table Bloat Demo:

psql blog; // blog is db namecreate table bloat_demo (id int) with (autovacuum_enabled = off);insert into bloat_demo select * from generate_series(1, 8000000);select pg_size_pretty(pg_relation_size('bloat_demo'));

Now update all the rows and check the size taken by the table; (We will see the size taken by the table will be doubled now)

update bloat_demo set id = id + 100;select pg_size_pretty(pg_relation_size('bloat_demo'));

Now Run Vacuum on the bloat_demo table to see if extra storage got cleared or not.

Interesting no storage is getting cleared with the Vacuum.

Reason — As discussed above, the vacuum will remove the dead(stale) rows from the table but that storage will not reclaim as data files still contain those storage. Vacuum offers storage to reuse by making the free stora in storage files(of the removed rows).

Now to check if our theory is correct that vacuumed free space is there to be used or not. To validate this run the update command again and see if additional space is taken by the table or not.

We can see from the below screenshot that no extra space is taken by the table as the update command used the space from files made available to us by vacuum.

Now we have used vacuum space, let’s update again and see if extra space is taken by the table or not.

An Extra 277MB is taken by the table as new 80 lakhs rows are created and all earlier rows are marked dead.

Ways to Solve Index Bloat Problem:

  1. Delete the existing index and recreate it. It will do things in the best way but some intermittent queries will be affected.
  2. Reindex the already created index — This will remove the bloat spaces but will make the index hanged(non-operational) till reindexing is entirely done that ultimately making the database inefficient for rebuilding time. We can optimize the reindexing way by doing concurrently which won’t affect queries.
Reindex index sales; ---> Reindex index concurrently sales;

The above concurrent reindexing will create a new index in the database with the suffix(_ccnew) and keep track of all table changes. When a new index is fully created then the old one is dropped and the table starts using the newly created one.

Drawbacks of Reindexing concurrently :

If something unexpected comes and the rebuilding process needs to be stopped or stopped due to some issue, then an intermittent level of index will exist in the system and will eat additional space. These invalid indexes need to be deleted manually.

Query to find Invalid Indexes:SELECTc.relname as index_name,pg_size_pretty(pg_relation_size(c.oid))FROMpg_index iJOIN pg_class c ON i.indexrelid = c.oidWHEREc.relname LIKE  '%_ccnew'AND NOT indisvalid;

Ways to solve Table Bloat:

  1. Vacuum Full : This will reclaim all the space used by the float and stale tuples but in turn creates a lot of problems like locking of a table(hanged) that ultimately makes it not a production-use solution. The only time we can use it can allow an application downtime we have taken for maintenance.
VACCUM FULL bloat_demo;

We can see from the table size that 553 MB is free with a vacuum full.

2. Pg Repack: This is a production-ready solution which helps in clearing bloat and stale data without any downtime. This extension needs to create(installed) first as it will be available during installation only.

CREATE EXTENSION pg_repack;

How Pg_repack works?

It creates a new table like we have seen in reindexing and copies all the data from the old table to a new table and after that rebuild the indexes. After full copying is done new table acts as the main one and the old one is dropped.

pg_repack -k --table sales_info sales;sales_info = table name
sales = database name

Shortcomings of pg_repack:

  1. During new table creation, it will require double storage as the same table is copied in storage so we need to keep a check on storage availability before starting the repacking table.
  2. If repacking process stopped in between then we need to manually delete the intermittent table.

Pg_repack is still the best solution for clearing bloats.

References:

https://www.youtube.com/watch?v=qcInj-XW1Vc

--

--

Full Stack Developer — React, Node, Mongo DB, Postgres, RabbitMQ, AWS, Native Performance Engineering, Lambda, Javascript, Kubernetes, Docker.