Faster Postgres migrations using PeerDB, Part 1

PeerDB is an ETL/ELT tool built for PostgreSQL. It enables you to Stream Query Results from a source Postgres database to a target Postgres database. You can use this feature to efficiently migrate data from one Postgres database to another. This blog captures the benefits of using PeerDB for Postgres migrations. It also covers a tutorial to easily test PeerDB for this scenario. The scope of this blog is restricted to efficient offline data migrations. A blog on online migrations is coming soon! :)

Fast, flexible and resilient Postgres migrations

Here are some benefits of using PeerDB for Postgres migrations:

Blazing fast large table migrations - Postgres' native migration utility pg_dump/pg_restore is single-threaded at a single table level. This can slow down migrations of a Postgres database with a single large table and multiple small tables (star schema). PeerDB enables you to multi-thread (parallelize) the migration at a single table level. This can reduce migration times by orders of magnitude.
Transform data during migration - PeerDB enables you to migrate data based on any SELECT query on the source database. As a part of the migration process, you can restrict columns, filter rows, denormalize rows and so on. For example, you can mask a few columns from your production while migrating to the staging database.
Simultaneous reading from source and writing to target - pg_dump/pg_restore allows you to parallelize data migration across tables using -j flag. In this mode, it cannot pg_dump | pg_restore i.e. simultaneously read from the source and write to the target. PeerDB can parallelize the migration across tables, within a single table and still streams rows to the target, as they are read. This can drastically improve resource utilization and improve migration performance significantly.
Fault-tolerant and resilient - PeerDB is a fully-fledged ETL product built for Postgres. It has mechanisms in place for fault tolerance - state management, automatic retries, handling idempotency and consistency and so on. This ensures that the migration experience is highly resilient and guarded against failures. More details on how PeerDB ensures resiliency are captured in this blog.

5-step tutorial to test PeerDB for Postgres migrations

Now let us go through a quick tutorial to migrate data from a source Postgres database to a target Postgres database using PeerDB:

Setup PeerDB in 2 minutes

Setup PeerDB on any machine by following this Quick Start. It would take you less than 2 minutes.

Step 1: Generate Sample Data on Source

You can use pgbench utility to quickly generate data on your source Postgres database. The below command would create and populate 3 tables (pgbench_accounts, pgbench_tellers and pgbench_branches) with around 10GB of data. pgbench_accounts contributes to most of the size and is around ~9.9GB. The rest of the 2 tables are way smaller i.e. few KB.

pgbench -i -s 700 "port=5432 host=localhost password=postgres user=postgres dbname=source"

(You can change the above connection string using the actual credentials of your source database)

Step 2: Schema-only dump and restore on target

You can run the below command to create a schema of pgbench_accounts, pgbench_tellers and pgbench_branches on a target Postgres database

pg_dump --schema-only -t 'pgbench_*' "port=5432 host=localhost password=postgres user=postgres dbname=source" | psql "port=5432 host=localhost password=postgres user=postgres dbname=target"

Step 3: Create Source and Target Postgres Peers

Connect to peerdb and create both the source and target Postgres Peers. More details on the CREATE PEER command can be found here.

psql "port=9900 host=localhost password=peerdb"

CREATE PEER source_peer FROM POSTGRES WITH
(
    host = '172.17.0.1',
    port = '5432',
    user = 'postgres',
    password = 'postgres',
    database = 'source'
);

CREATE PEER target_peer FROM POSTGRES WITH
(
    host = '172.17.0.1',
    port = '5432',
    user = 'postgres',
    password = 'postgres',
    database = 'target'
);

Step 4: Data Migration using CREATE MIRROR command

Let us use PeerDB's CREATE MIRROR command to migrate the smaller tables pgbench_branches and pgbench_tellers This command gives multiple options, here I am setting the initial_copy_only to true. This ensures that the MIRROR stops immediately stops after data from the source table is migrated to the target table.


CREATE MIRROR migrate_pgbench_branches FROM
        source_peer TO target_peer FOR
$$ SELECT * FROM pgbench_branches $$
WITH (
        destination_table_name = 'public.pgbench_branches',
        initial_copy_only=true
);

CREATE MIRROR migrate_pgbench_tellers FROM
        source_peer TO target_peer FOR
$$
  SELECT * FROM pgbench_tellers
$$
WITH (
        destination_table_name = 'public.pgbench_tellers',
        initial_copy_only=true
);

-- in a few minutes the above tables should be migrated.
-- run below commands to quickly validate ther migration
SELECT count(*) FROM target_peer.pgbench_branches;
SELECT count(*) FROM target_peer.pgbench_branches;

Parallelized and blazing-fast migration of a large table

If a large table has an incremental int or a timestamp column (a.k.a. watermark_column), CREATE MIRROR can parallelize the migration of that table from source to target. You can specify the watermark_column, parallelism and num_rows_per_partition as inputs to the CREATE MIRROR command. Under the hood, PeerDB takes up all the heavy lifting to logically partition the table and stream data using multiple threads. This can reduce migration times by orders of magnitude. See the below

 CREATE MIRROR migrate_pgbench_accounts FROM
        source_peer TO target_peer FOR
$$
  SELECT * FROM pgbench_accounts WHERE aid BETWEEN {{.start}} AND {{.end}}
$$
WITH (
        destination_table_name = 'public.pgbench_accounts',
        initial_copy_only=true,
        parallelism = 8,
        watermark_column = 'aid',
        watermark_table_name = 'pgbench_accounts',
        num_rows_per_partition = 300000
);

Benchmarks

Part 2 of this blog will be benchmarking PeerDB at scale for migrating large Postgres databases. As a teaser, you can expect numbers captured in this blog (scroll to the bottom), that I wrote a few years ago. And yes, it was one of my favorite blogs :)

Hope you enjoyed reading the blog. Check out peerdb.io and get started with PeerDB to efficiently migrate your Postgres databases!

Faster Postgres migrations using PeerDB, Part 1

Fast, flexible and resilient Postgres migrations

5-step tutorial to test PeerDB for Postgres migrations

Setup PeerDB in 2 minutes

Step 1: Generate Sample Data on Source

Step 2: Schema-only dump and restore on target

Step 3: Create Source and Target Postgres Peers

Step 4: Data Migration using CREATE MIRROR command

Parallelized and blazing-fast migration of a large table

Benchmarks

Comments

More from this blog

Postgres CDC connector for ClickPipes is now in Private Preview

Postgres to ClickHouse: Data Modeling Tips

Enhancing Postgres to ClickHouse replication using PeerDB

ClickHouse acquires PeerDB for native Postgres CDC integration

PeerDB is now SOC 2 Type 2 Compliant

Command Palette

Fast, flexible and resilient Postgres migrations

5-step tutorial to test PeerDB for Postgres migrations

Setup PeerDB in 2 minutes

Step 1: Generate Sample Data on Source

Step 2: Schema-only dump and restore on target

Step 3: Create Source and Target Postgres Peers

Step 4: Data Migration using CREATE MIRROR command

Parallelized and blazing-fast migration of a large table

Benchmarks

Comments

More from this blog