Oh index, where art thou?

Problem

We recently noticed some curious behavior around indices seemingly popping in and out of existence on our Postgres data warehouse tables that are built by dbt (i.e. any tables in postgres prefixed with dw_r_ ).

Fig 1. pganalyze table statistics for public.dw_r_orders_sold_by_day. Index size fluctuates significantly over time.

Around the same time, we noticed an increasing number of New Slow Query alerts from pganalyze. These alerts seemed to be largely stemming  from performance issues on dbt-generated data warehouse tables where indices were missing. In order to remedy these performance issues, we  needed to understand what was causing this odd indexing behavior.

After digging into the various materialization strategies that dbt deploys, we were able to isolate the underlying query patterns pertaining to each  strategy. The strategies that dbt deploys to build data warehouse tables largely fall into three buckets: Scratch, Incremental, and Full Refresh.

Scratch

dbt will build a table “from scratch” when it sees that the table does not currently exist in the data warehouse. This strategy is slightly different  than the “full refresh” strategy that we will go into detail on later. When dbt builds a table “from scratch”, it uses a CREATE TABLE AS .. statement to build the table.

Once the table is created, we issue dbt post-hooks to create indices on the table that was just built. The thing to remember is that in postgres, an  index is namespaced to the schema of the table it is applied on. This means that within a postgres schema, there can only be a single index  with a given name (this will be important later). So far so good, everything works as expected.

Incremental

If dbt sees that a table already exists in the data warehouse, it will attempt to build the table “incrementally”. In practice, this means that instead of  creating the table from scratch, it will simply insert new data and delete old data.

This materialization strategy helps us keep our dbt builds fast and lean. Additionally, our index creation post-hooks are still triggered after these D ELETE and INSERT statements are run, but the majority of the time they are no-ops since the indices likely already exist on the table. Unless  they don’t…

Full Refresh

Periodically, we will tell dbt to “full refresh” a given table. This is effectively identical to telling dbt to rebuild a table from scratch, regardless if it  already exists in the data warehouse or not. This strategy is useful for instances where the underlying model schema or business logic has  changed and we need to go back and re-aggregate old records. The way dbt performs a Full Refresh under the hood is slightly different from the  way that it created tables from scratch, and leads to rather interesting side effects.

The first thing dbt does during a full refresh is rename the existing table to create a “backup” version of the table. It does this by issuing an ALTER TABLE .. RENAME command and suffixes the table with __dbt_backup . However, an interesting side effect of this is that any existing  indices on the original table travel with the table during a rename.

This is an important point: the indices are now applied to a “backup” table, but as far as postgres is aware, these indices still exist in the  schema. After the backup table creation, dbt will create the table from scratch:

However, when our index creation post-hooks run, postgres sees that the desired indices already exist in the schema, and the operation  becomes a no-op. Postgres is completely unaware that the indices exist, but are now applied to the wrong table. Finally, as the last step of the  full refresh materialization strategy, dbt drop cascades the backup table once it has confirmed that all other operations were successful.

This DROP TABLE .. CASCADE statement on the backup table ends up cascading the indices with it. After this operation is complete, the  indices won’t exist in the schema at all. This means that on subsequent incremental runs, the indices can and will be applied to the correct  table. However, this also implies that whenever a model undergoes a “full refresh” build, there will be a period where indices do not exist  on the table.

Solution (#70)

To remedy this, we made a small change to the way that Full Refresh builds work. We now run DROP INDEX IF EXISTS statements as pre hooks to the table build, but these hooks will only run if the build is Full Refresh. This prevents us from unnecessarily dropping indices  during other types of materialization strategies (e.g. from scratch and incremental). Since we are now dropping any indices on the table prior to it  being renamed by dbt, this means that these indices will no longer travel with the table during the renaming process. Thus, when we go  to create the indices during the post-hook process, the indices will be applied to the correct table as expected

Additionally, we added a few custom schema tests that will help us retain confidence that the expected number of indices are always applied to a  model after it is built by dbt. These tests will be run in dbt cloud immediately after each model builds. If the tests fail, we will immediately be  alerted to the issue.

David Wallace

David Wallace