Optimizing Postgres Replication using Column Exclusion

Optimizing Postgres Replication using Column Exclusion

At PeerDB we are building a fast, simple and feature-rich experience to replicate data from Postgres to Data Warehouses, Queues and Storage. With this goal, it was a given that we enable users to replicate data from a subset of columns of a table and exclude data from the rest.

This blog will cover different use cases that require Column Exclusion while replicating data from Postgres. It will also get into how you can accomplish this with PeerDB.

Why exclude columns while replicating data from Postgres?

Column Exclusion was a common feature request from our customers. Below are a few use cases where they needed this feature:

  1. Data Security and Privacy - Excluding sensitive information like personally identifiable (PII) data or proprietary business information from replication helps maintain data security and confidentiality.

  2. Data Minimization: In many scenarios, not all columns are required for analysis or downstream processing. Through column exclusion, you can tailor replication to suit your exact needs avoiding unnecessary data transfer. This helps with:

    1. Improve Performance: Transmitting only necessary columns streamlines data transfer, leading to improved replication performance. This is particularly significant when dealing with large tables or limited network bandwidth.

    2. Save costs: Sending only needed columns cuts down on network usage, reduces the compute needed for processing rows, and saves storage on the destination. This leads to overall cost savings.

PeerDB for excluding columns during Postgres replication

We recently added the Column Exclusion feature to enable users to pick and choose columns while replicating tables from Postgres. You can do this as a part of the Change Data Capture (CDC) / WAL-based replication.

Supported Targets

Column Exclusion extends to all the supported targets incl. Data Warehouses such as Snowflake, BigQuery and Postgres; Queues such as Azure Event Hubs and Storage such as S3 and Google Cloud Storage.

You can either use PeerDB's UI or the SQL Layer to include and exclude columns while creating the MIRROR (a.k.a. replication).

Column Exclusion through PeerDB UI

PeerDB provides a simple-to-use UI to create and manage MIRRORs. While creating CDC / WAL-based MIRRORs, you can choose the specific columns per table, that you'd want to replicate using checkboxes. The below gif captures how to exclude columns while replicating data from Postgres to Snowflake.

Column Exclusion through SQL Layer

PeerDB provides a Postgres-compatible SQL Layer to easily create and manage replication from Postgres. We improved the CREATE MIRROR command to enable users to exclude columns. The below script shows an example command to exclude PII data such as email, height and weight while replicating user_details table from Postgres to Snowflake

CREATE MIRROR cdc_replicate_users FROM postgres_peer TO snowflake_peer
 WITH TABLE MAPPING (
    {
        from:public.user_details,
        to:public.user_details,
        exclude:[email, height, weight]
    }
  ) WITH (
      do_initial_copy = true,
      snapshot_sync_mode='avro',
      snapshot_num_rows_per_partition = 500000,
      snapshot_max_parallel_workers = 4,
      snapshot_num_tables_in_parallel = 4,
      snapshot_staging_path = ''
  );

Conclusion

If you have a use case that requires excluding columns while replicating data from Postgres and want to give PeerDB a try, these links should prove useful: :)

  1. Quickstart

  2. PeerDB's Github repo

  3. Join PeerDB's Slack community

  4. PeerDB docs