<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[PeerDB Blog]]></title><description><![CDATA[At PeerDB, we are building a fast, simple and the most cost effective way to stream data from Postgres to Data Warehouses, Queues and Storage engines.]]></description><link>https://blog.peerdb.io</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1689020945974/Qh_S844-Q.png</url><title>PeerDB Blog</title><link>https://blog.peerdb.io</link></image><generator>RSS for Node</generator><lastBuildDate>Sun, 12 Apr 2026 05:03:43 GMT</lastBuildDate><atom:link href="https://blog.peerdb.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Postgres CDC connector for ClickPipes is now in Private Preview]]></title><description><![CDATA[Today, we’re excited to announce the private preview of the Postgres Change Data Capture (CDC) connector in ClickPipes! This enables customers to replicate their Postgres databases to ClickHouse Cloud in just a few clicks and leverage ClickHouse for ...]]></description><link>https://blog.peerdb.io/postgres-cdc-connector-for-clickpipes-is-now-in-private-preview</link><guid isPermaLink="true">https://blog.peerdb.io/postgres-cdc-connector-for-clickpipes-is-now-in-private-preview</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[streaming]]></category><category><![CDATA[replication]]></category><category><![CDATA[migration]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Mon, 25 Nov 2024 15:49:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732548180251/b709b7ed-6935-456b-a800-c665124d4785.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we’re excited to announce the private preview of the Postgres Change Data Capture (CDC) connector in ClickPipes! This enables customers to replicate their Postgres databases to ClickHouse Cloud in just a few clicks and leverage ClickHouse for blazing-fast analytics. You can use this connector for both continuous replication and one-time migrations use cases from Postgres.</p>
<p>The experience is natively integrated into ClickHouse Cloud through ClickPipes, the integration engine designed to simplify moving massive volumes of data to ClickHouse. This eliminates the need for external ETL tools, which are often expensive, slow, and don’t scale for Postgres.</p>
<p><strong>👉You can sign up to the private preview by following this</strong> <a target="_blank" href="https://clickhouse.com/cloud/clickpipes/postgres-cdc-connector"><strong>link</strong></a><strong>.</strong></p>
<p>Just a reminder, ClickHouse <a target="_blank" href="https://clickhouse.com/blog/clickhouse-welcomes-peerdb-adding-the-fastest-postgres-cdc-to-the-fastest-olap-database">joined forces</a> with  PeerDB, a leading Change Data Capture (CDC) provider for Postgres, a few months ago. PeerDB already supports multiple enterprise-grade workloads and has helped replicate petabytes of data from Postgres to ClickHouse. Over the past few months, the team has worked hard to natively integrate PeerDB into ClickHouse Cloud. This announcement marks the first release of this integration, enabling users to seamlessly move data from Postgres to ClickHouse.</p>
<p>The Postgres CDC connector was built in close collaboration with several customers and design partners who are already running production-grade workloads. Here are a few customer testimonials:</p>
<p><em>“PeerDB has been a game-changer for us, effortlessly migrating tens of terabytes from our Postgres warehouse into ClickHouse and keeping millions of daily orders synced with just seconds of latency. We're really excited about PeerDB's native integration into ClickHouse Cloud via ClickPipes and all of the opportunities it opens up for us.” -</em> <strong><em>SpotOn</em></strong></p>
<p><em>“We already reduced our Postgres to ClickHouse snapshot times from 10+ hours down to 15 minutes with PeerDB. Combining ClickHouse’s powerful analytics natively with PeerDB’s real-time data capture capabilities will greatly simplify our data processing workflows. This integration will enable us to build analytical applications faster, giving us a competitive edge in the market.”</em> <strong><em>- Vueling</em></strong></p>
<p>Without further ado, here is the demo of Postgres CDC connector in ClickPipes:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=fHuFSmafYUo">https://www.youtube.com/watch?v=fHuFSmafYUo</a></div>
<p> </p>
<h2 id="heading-postgres-clickhouse-a-powerful-data-stack">Postgres + ClickHouse, a powerful data stack</h2>
<p>Using ClickHouse and PostgreSQL through a seamless CDC integration creates a powerful data stack by combining PostgreSQL's robust transactional capabilities with ClickHouse's high-performance analytics. CDC ensures real-time synchronization, allowing ClickHouse to handle fast queries on massive datasets without burdening PostgreSQL. This integration delivers real-time insights and scalable analytics, making it an ideal solution for modern, data-driven workflows. Below are a few main advantages of this architecture:</p>
<ol>
<li><p><strong>Full workload isolation:</strong> You can continue building your O LTP application on Postgres and your OLAP application on ClickHouse, with complete workload isolation—analytics will not affect your transactional workload.</p>
</li>
<li><p><strong>No compromises on features:</strong> It also allows you to build your applications using the full capabilities and features (e.g., SQL coverage, performance, etc.) of both Postgres and ClickHouse, each optimized for a specific workload.</p>
</li>
</ol>
<p>We believe customers derive the most value in solving real-world data problems by leveraging purpose-built databases like Postgres and ClickHouse as they were designed, with full flexibility, rather than relying on alternatives that retrofit one database engine into another, compromising the full feature set of each. We are observing a clear <a target="_blank" href="https://x.com/kiwicopple/status/1851638636590035054">trend</a> towards the Postgres + ClickHouse architecture among real-world customers.</p>
<h2 id="heading-key-benefits">Key Benefits</h2>
<p>The Postgres CDC connector in ClickPipes is purpose-built for Postgres and ClickHouse, ensuring a fast, simple and a cost effective replication experience. Here are some key benefits for customers:</p>
<h3 id="heading-blazing-fast-performance">Blazing Fast Performance</h3>
<p>With features like parallel snapshotting, you can achieve 10x faster initial loads, transferring terabytes of data in hours instead of days, and experience replication latency as low as a few seconds for continuous replication (CDC).</p>
<h3 id="heading-super-simple">Super Simple</h3>
<p>You can start replicating your Postgres databases to ClickHouse in just a few clicks and minutes. Simply add your Postgres database as a source, select the specific tables/columns you want to replicate, and you're ready to go.</p>
<h3 id="heading-postgres-and-clickhouse-native-features">Postgres and ClickHouse native features</h3>
<p>This connector supports native Postgres features such as replication of schema changes, partitioned tables, built-in monitoring and alerting for replication slot size, and support for complex data types such as JSONB and ARRAYs, among others.</p>
<p>On the ClickHouse side, it supports features such as selecting specialized table engines, configuring custom order keys, choosing nullable columns, and so on during the replication process.</p>
<h3 id="heading-enterprise-grade-security">Enterprise-grade security</h3>
<p>At ClickHouse, security is a top priority, even before performance and features. We’ve extended the same level of security to the Postgres CDC connector in ClickPipes. It includes features such as SSH tunneling and Private Link to securely connect to your Postgres databases. Data in transit is fully encrypted using SSL.</p>
<h3 id="heading-no-vendor-lock-in">No vendor lock-in</h3>
<p>The Postgres CDC connector is powered by PeerDB, which is fully open source <a target="_blank" href="https://github.com/PeerDB-io/peerdb/">https://github.com/PeerDB-io/peerdb/</a>. With the exception of the UI, we have ensured that all components are directly extended from the PeerDB open-source project. This underscores our commitment to open-source and ensures there is no vendor lock-in for our customers.</p>
<h2 id="heading-how-to-sign-up-for-private-preview">How to sign up for Private Preview?</h2>
<p><a target="_blank" href="https://clickhouse.com/cloud/clickpipes/postgres-cdc-connector"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXc5_VSt7G0koZg4z_phXYKqAGDfD3ZXmilHT8FeDX4JT_ifOLC6o4XashmFynYUIAJ92KQy1B5tEQs9Wlmg3ErYpUM3713dkHGPxYLn5KdhFBycHD1m9N0u4nRL-uUC6Lr6oM9w?key=qnroIGmQjh8ZytQ0G3msRaWj" alt /></a></p>
<p>You can sign up for the private preview by filling out the form on <a target="_blank" href="https://clickhouse.com/cloud/clickpipes/postgres-cdc-connector">this page</a>. Our team will reach out to you within a day and closely collaborate with you to provide early access. The Private Preview entails no cost and is fully free. This is a great opportunity for you to get firsthand experience with the native Postgres integration in ClickHouse Cloud and directly influence the roadmap. Looking forward to having you onboard!</p>
]]></content:encoded></item><item><title><![CDATA[Postgres to ClickHouse: Data Modeling Tips]]></title><description><![CDATA[Last month, we acquired PeerDB, a company that specializes in Postgres CDC. PeerDB makes it fast and simple to replicate data from Postgres to ClickHouse. A common question from PeerDB users is how to model their data in ClickHouse after the replicat...]]></description><link>https://blog.peerdb.io/postgres-to-clickhouse-data-modeling-tips</link><guid isPermaLink="true">https://blog.peerdb.io/postgres-to-clickhouse-data-modeling-tips</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[data-modeling]]></category><category><![CDATA[replication]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Wed, 28 Aug 2024 15:32:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724803156847/66f1221b-7730-4fe0-a109-2b14c4fe23fe.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last month, we <a target="_blank" href="https://clickhouse.com/blog/clickhouse-welcomes-peerdb-adding-the-fastest-postgres-cdc-to-the-fastest-olap-database">acquired PeerDB</a>, a company that specializes in Postgres CDC. <a target="_blank" href="https://www.peerdb.io/">PeerDB</a> makes it fast and simple to replicate data from <a target="_blank" href="https://www.postgresql.org/">Postgres</a> to <a target="_blank" href="https://clickhouse.com/">ClickHouse</a>. A common question from PeerDB users is how to model their data in ClickHouse after the replication process to maximize the benefits of ClickHouse.</p>
<p>This question arises because ClickHouse and Postgres differ in data modeling, as each is a <strong>purpose-built database</strong> highly optimized for its specific workload -Postgres is a transactional (OLTP) database, while ClickHouse is an analytical (OLAP) columnar database. This guide walks you through essential data modeling concepts in ClickHouse for users coming from the Postgres world. Note that this is part 1 of a blog series, with more to come in the future.</p>
<h2 id="heading-replacingmergetree-table-engine">ReplacingMergeTree table engine</h2>
<p>PeerDB maps PostgreSQL tables to ClickHouse using the <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree">ReplacingMergeTree</a> engine. ClickHouse performs best with append-only workloads and <a target="_blank" href="https://clickhouse.com/docs/en/guides/developer/mutations">does not recommend</a> frequent UPDATEs. This is where ReplacingMergeTree is particularly powerful.</p>
<p><code>ReplacingMergeTree</code> supports workloads that involve both data ingestion and modifications. Each table is append-only, with user updates ingested as versioned INSERTs. The ReplacingMergeTree engine manages deduplication (merging) of rows in the background. This is one of the key factors that enables ClickHouse to deliver exceptional real-time ingestion performance.</p>
<p>In PeerDB, both INSERTs and UPDATEs from Postgres are captured as new rows with different versions (using <code>_peerdb_version</code>) in ClickHouse. The <code>ReplacingMergeTree</code> table engine periodically handles deduplication in the background using the Ordering Key (ORDER BY columns), retaining only the row with the latest <code>_peerdb_version</code>. DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the <code>_peerdb_is_deleted</code> column). The snippet below shows the target table definition for the <code>public_goals</code> table in ClickHouse.</p>
<pre><code class="lang-sql">clickhouse-cloud :) <span class="hljs-keyword">SHOW</span> <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> public_goals;
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> peerdb.public_goals
(
    <span class="hljs-string">`id`</span> Int64,
    <span class="hljs-string">`owned_user_id`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`goal_title`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`goal_data`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`enabled`</span> <span class="hljs-built_in">Bool</span>,
    <span class="hljs-string">`ts`</span> DateTime64(<span class="hljs-number">6</span>),
    <span class="hljs-string">`_peerdb_synced_at`</span> DateTime64(<span class="hljs-number">9</span>) <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>(),
    <span class="hljs-string">`_peerdb_is_deleted`</span> <span class="hljs-built_in">Int8</span>,
    <span class="hljs-string">`_peerdb_version`</span> Int64
)
<span class="hljs-keyword">ENGINE</span> = SharedReplacingMergeTree
(<span class="hljs-string">'/clickhouse/tables/{uuid}/{shard}'</span>, <span class="hljs-string">'{replica}'</span>, _peerdb_version)
PRIMARY <span class="hljs-keyword">KEY</span> <span class="hljs-keyword">id</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">id</span>
<span class="hljs-keyword">SETTINGS</span> index_granularity = <span class="hljs-number">8192</span>
</code></pre>
<h2 id="heading-you-might-still-see-duplicates-for-rowshow-should-you-handle-them">You might still see duplicates for rows—how should you handle them?</h2>
<p>ReplacingMergeTree clears out duplicates asynchronously in the background but doesn't guarantee the absence of duplicates. So, when you query the data, you might still see duplicates for the same row or primary key but with different versions. This is expected. To remove duplicates, you have a couple of approaches:</p>
<h3 id="heading-use-final-in-your-queries">Use FINAL in your queries</h3>
<p>ClickHouse has a unique modifier called <a target="_blank" href="https://clickhouse.com/docs/en/sql-reference/statements/select/from#final-modifier">FINAL</a>, which performs de-duplication (merging of rows) at query time. This de-duplication occurs after filtering (WHERE clause) but before aggregations (GROUP BY).</p>
<p>A historical concern has been that FINAL can slow down query performance. While it does impact query performance to some extent, recent releases of ClickHouse have introduced <a target="_blank" href="https://github.com/ClickHouse/ClickHouse/issues/11722">significant improvements</a> to enhance FINAL query performance. So, don’t hesitate to use the FINAL clause and evaluate how your queries perform. Below is an example of how to use the FINAL clause:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> owner_user_id, <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">FROM</span> goals <span class="hljs-keyword">FINAL</span> 
<span class="hljs-keyword">WHERE</span> enabled = <span class="hljs-literal">true</span> <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> owner_user_id;
</code></pre>
<h3 id="heading-use-argmax-to-deduplicate-rows-at-query-time">Use argMax to deduplicate rows at query time</h3>
<p>In ClickHouse, <a target="_blank" href="https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/argmax">argMax</a> is a powerful function for deduplicating rows dynamically during query execution. This is particularly useful when you need to retain the most recent or relevant record based on a versioning or timestamp column.</p>
<p>For instance, if you're working with a table like <code>peerdb.public_goals</code>, where id is the primary key and <code>_peerdb_version</code> tracks versions, you can use argMax to select the row with the highest <code>_peerdb_version</code> for each <code>id</code>. This approach allows you to efficiently remove duplicates without altering the underlying data. You can then run your aggregations as a subquery over this deduplicated result set for further analysis. Below query is an example of using argMax</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    owned_user_id,
    <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> active_goals_count,
    <span class="hljs-keyword">MAX</span>(ts) <span class="hljs-keyword">AS</span> latest_goal_time
<span class="hljs-keyword">FROM</span>
(
    <span class="hljs-keyword">SELECT</span>
        <span class="hljs-keyword">id</span>,
        argMax(owned_user_id, _peerdb_version) <span class="hljs-keyword">AS</span> owned_user_id,
        argMax(goal_title, _peerdb_version) <span class="hljs-keyword">AS</span> goal_title,
        argMax(goal_data, _peerdb_version) <span class="hljs-keyword">AS</span> goal_data,
        argMax(enabled, _peerdb_version) <span class="hljs-keyword">AS</span> enabled,
        argMax(ts, _peerdb_version) <span class="hljs-keyword">AS</span> ts,
        argMax(_peerdb_synced_at, _peerdb_version) <span class="hljs-keyword">AS</span> _peerdb_synced_at,
        argMax(_peerdb_is_deleted, _peerdb_version) <span class="hljs-keyword">AS</span> _peerdb_is_deleted,
        <span class="hljs-keyword">max</span>(_peerdb_version) <span class="hljs-keyword">AS</span> _peerdb_version
    <span class="hljs-keyword">FROM</span> peerdb.public_goals
    <span class="hljs-keyword">WHERE</span> enabled = <span class="hljs-literal">true</span>
    <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">id</span>
) <span class="hljs-keyword">AS</span> deduplicated_goals
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> owned_user_id;
</code></pre>
<h3 id="heading-use-window-functions">Use WINDOW FUNCTIONS</h3>
<p>You can use ClickHouse's <a target="_blank" href="https://clickhouse.com/docs/en/sql-reference/window-functions">window functions</a> to achieve similar deduplication by selecting the row with the highest <code>_peerdb_version</code> within each id partition. Here's an example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    owned_user_id,
    <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> active_goals_count,
    <span class="hljs-keyword">MAX</span>(ts) <span class="hljs-keyword">AS</span> latest_goal_time
<span class="hljs-keyword">FROM</span>
(
    <span class="hljs-keyword">SELECT</span>
        *,
        ROW_NUMBER() <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">id</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> _peerdb_version <span class="hljs-keyword">DESC</span>) <span class="hljs-keyword">AS</span> rn
    <span class="hljs-keyword">FROM</span> peerdb.public_goals
    <span class="hljs-keyword">WHERE</span> enabled = <span class="hljs-literal">true</span>
) <span class="hljs-keyword">AS</span> ranked_goals
<span class="hljs-keyword">WHERE</span> rn = <span class="hljs-number">1</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> owned_user_id;
</code></pre>
<h3 id="heading-use-views-to-simplify-deduplication">Use Views to simplify deduplication</h3>
<p>Encapsulate deduplication in a <a target="_blank" href="https://clickhouse.com/docs/en/sql-reference/statements/create/view">view</a> to make it simple for BI tools to query the most up-to-date data. For example, use a window function in the view to keep only the latest version of each row:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">VIEW</span> goals <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span>
(
    <span class="hljs-keyword">SELECT</span>
        *,
        ROW_NUMBER() <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">id</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> _peerdb_version <span class="hljs-keyword">DESC</span>) <span class="hljs-keyword">AS</span> rn
    <span class="hljs-keyword">FROM</span> peerdb.public_goals
    <span class="hljs-keyword">WHERE</span> enabled = <span class="hljs-literal">true</span>
) <span class="hljs-keyword">WHERE</span> rn = <span class="hljs-number">1</span>;
</code></pre>
<pre><code class="lang-sql">
<span class="hljs-keyword">SELECT</span>
    owned_user_id,
    <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> active_goals_count,
    <span class="hljs-keyword">MAX</span>(ts) <span class="hljs-keyword">AS</span> latest_goal_time
<span class="hljs-keyword">FROM</span> goals
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> owned_user_id;
</code></pre>
<h2 id="heading-nullable-columns">Nullable Columns</h2>
<p>If you're coming from the Postgres world, one surprising aspect of ClickHouse is that it doesn’t store NULL values for columns unless you explicitly wrap the column types in <a target="_blank" href="https://clickhouse.com/docs/en/sql-reference/data-types/nullable"><code>Nullable</code></a>. For example, instead of storing NULL for dates, ClickHouse stores <code>1970-01-01</code> as the default value, which might be unexpected. This behavior is due to the fact that storing NULLs can <a target="_blank" href="https://clickhouse.com/docs/en/sql-reference/data-types/nullable">impact</a> query performance in ClickHouse, as it’s a columnar database. Hence, ClickHouse requires users to explicitly define <code>Nullable</code> types.</p>
<p>In PeerDB, we’ve introduced a setting called <code>PEERDB_NULLABLE</code>, which, when set to <code>true</code>, automatically detects nullable columns in Postgres and marks them as <code>Nullable</code> in ClickHouse during the replication process. This means you don’t need to manually define <code>Nullable</code> types during replication. You can read more about this feature in the following <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/2001">PR</a>.</p>
<h2 id="heading-data-types"><strong>Data Types</strong></h2>
<p>ClickHouse offers a wide variety of data types, ranging from numbers, text, timestamps, dates, and arrays to the recently introduced <a target="_blank" href="https://github.com/ClickHouse/ClickHouse/issues/54864">JSON</a> type. Many of the data types in Postgres can be natively stored in ClickHouse without much modification.</p>
<p>As a reference, here goes <a target="_blank" href="https://docs.peerdb.io/datatypes/datatype-matrix">the data type matrix</a> we use at PeerDB when replicating data from Postgres to ClickHouse.</p>
<h2 id="heading-the-ordering-key">The Ordering Key</h2>
<h3 id="heading-what-is-an-ordering-key">What is an Ordering Key?</h3>
<p>Choosing the right ordering key is crucial for query performance in ClickHouse. Defined by the <code>ORDER BY</code> clause when creating a table, the ordering key functions similarly to a index in Postgres but is optimized for analytics. Unlike Postgres, which uses a B-tree index with entries pointing to each row, ClickHouse uses Sparse Indexing:</p>
<ol>
<li><p><strong>Data is sorted based on Ordering Key:</strong> The ordering key ensures that data on disk is sorted according to the specified columns. This allows for better <a target="_blank" href="https://clickhouse.com/docs/en/data-compression/compression-in-clickhouse">compression</a>, as correlated values are stored together.</p>
</li>
<li><p><strong>Ordering Key also creates a sparse index:</strong> The ordering key also creates a sparse index, storing only ranges of columns, with each entry pointing to a group of sorted rows. This keeps the index small, allowing ClickHouse to quickly identify relevant groups of rows using a binary search and execute queries efficiently. You can read more about this <a target="_blank" href="https://clickhouse.com/docs/en/migrations/postgresql/designing-schemas#primary-ordering-keys-in-clickhouse">here</a>.</p>
</li>
</ol>
<p>You can think of ordering keys as similar to <a target="_blank" href="https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BRIN">BRIN</a> indexes in Postgres, but in ClickHouse, the data is automatically sorted based on the ordering key via asynchronous merging of parts, so you don’t need to handle sorting during data ingestion.</p>
<h3 id="heading-choosing-an-appropriate-ordering-key">Choosing an appropriate Ordering Key</h3>
<p>When selecting an ordering key, base your choice on the columns most frequently used in your query filters. <strong>Prioritize columns that are commonly used in WHERE clauses, and order them in ascending sequence of cardinality</strong>—starting with columns that have the fewest distinct values. This approach optimizes data compression and query performance. For a deeper understanding of this topic, refer to the detailed guide <a target="_blank" href="https://clickhouse.com/docs/en/data-modeling/schema-design#choosing-an-ordering-key">here</a>.</p>
<h3 id="heading-primary-key-vs-ordering-key"><strong>PRIMARY KEY vs Ordering Key</strong></h3>
<p>If you observe the table definition of <code>public_goals</code>, it has a <code>PRIMARY KEY</code>. You might be wondering how the <code>PRIMARY KEY</code> differs from the Ordering Key. Let us understand how they differ:</p>
<ol>
<li><p><code>PRIMARY KEY</code>, if specified, defines the columns in the sparse index, while the columns in the <code>ORDER BY</code> clause determine how the data is sorted on disk. They are also used for deduplicating data by the <code>ReplacingMergeTree</code>.</p>
</li>
<li><p>If the <code>PRIMARY KEY</code> isn't specified, the Ordering Key automatically becomes the <code>PRIMARY KEY</code> and defines the columns in the sparse index.</p>
</li>
</ol>
<p><strong>NOTE:</strong> Columns in the <code>PRIMARY KEY</code> should always be prefixed in the Ordering Key. This ensures that the index aligns with the physical data order, maximizing query performance by minimizing unnecessary data scans.</p>
<p><strong>An example where</strong> <code>PRIMARY KEY</code> <strong>could differ from Ordering Key</strong></p>
<p>An example where you might have different <code>PRIMARY KEY</code> and <code>ORDER BY</code> columns is when your queries are primarily filtered on <code>customer_id</code> rather than <code>id</code>. In this case, you can define the <code>PRIMARY KEY</code> on just <code>customer_id</code> and the <code>ORDER BY</code> on <code>customer_id, id</code>. This approach ensures a smaller, more efficient sparse index for querying, while data deduplication occurs on <code>id</code>, ensuring no data is lost.</p>
<p><strong>NOTE:</strong> Unlike in Postgres, where the <code>PRIMARY KEY</code> is a B-tree index that guarantees uniqueness, in ClickHouse, it does not ensure uniqueness. Instead, it defines the columns that should be part of the sparse index.</p>
<h3 id="heading-modifying-the-ordering-key">Modifying the Ordering Key</h3>
<p>Choosing the right <a target="_blank" href="https://clickhouse.com/docs/en/migrations/postgresql/designing-schemas#primary-ordering-keys-in-clickhouse">ordering key</a> is crucial for query performance in ClickHouse, as it acts as an index when querying data. By default, PeerDB uses the PostgreSQL <code>PRIMARY KEY</code> to define the ordering key in ClickHouse tables, but you can change it using the following methods:</p>
<h3 id="heading-use-materialized-views">Use materialized views</h3>
<p>You can use materialized views to create a new table with a different ordering key suitable for your workload. Include the primary key columns at the end of the ordering key to ensure proper deduplication, as ReplacingMergeTree uses the ORDER BY clause for deduplication, and including the primary key ensures that no data is lost.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">MATERIALIZED</span> <span class="hljs-keyword">VIEW</span> goals_mv
<span class="hljs-keyword">ENGINE</span> = ReplacingMergeTree(_peerdb_version)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (enabled, ts, <span class="hljs-keyword">id</span>)  POPULATE <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> peerdb.public_goals;
</code></pre>
<p><strong>NOTE:</strong> After creating the materialized view, be sure to follow the steps described in the previous section on handling duplicates to ensure proper deduplication during query time.</p>
<h3 id="heading-predefine-target-tables-with-the-desired-ordering-key">Predefine target tables with the desired Ordering Key</h3>
<p>To change the ordering key, you can predefine new tables with your desired Ordering Key and then swap them with the existing tables. Here's how you can do it:</p>
<p><strong>1. Create a Dummy Mirror:</strong> Create a dummy mirror in PeerDB to generate the default tables with the correct metadata columns and data types.</p>
<p><strong>2. Create a New Table with the Desired Ordering Key:</strong> Use the table created by PeerDB to define a new table with your desired ordering key. Include the primary key columns at the end of the ordering key to ensure proper deduplication. Here is an example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> public_events_new <span class="hljs-keyword">AS</span> public_events
<span class="hljs-keyword">ENGINE</span> = ReplacingMergeTree(_peerdb_version)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (user_id,<span class="hljs-keyword">id</span>);
</code></pre>
<p><strong>3. Drop the Old Table:</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> public_events;
</code></pre>
<p><strong>4. Rename the New Table:</strong> Rename the new table to the actual table</p>
<pre><code class="lang-sql"><span class="hljs-keyword">RENAME</span> <span class="hljs-keyword">TABLE</span> public_events_new <span class="hljs-keyword">TO</span> public_events;
</code></pre>
<p><strong>5. Start MIRROR to Point to the New Table:</strong> Configure the mirror to point to the actual table. PeerDB uses <code>CREATE TABLE IF NOT EXISTS</code> behind the scenes and continues to ingest data into the new table.</p>
<h2 id="heading-handling-deletes">Handling DELETEs</h2>
<p>As mentioned, DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the <code>_peerdb_is_deleted</code> column). To exclude deleted rows from your queries, you can create row-level policies in ClickHouse based on the <code>_peerdb_is_deleted</code> column. Here’s an example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">ROW</span> <span class="hljs-keyword">POLICY</span> policy_name <span class="hljs-keyword">ON</span> table_name
<span class="hljs-keyword">FOR</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">USING</span> _peerdb_is_deleted = <span class="hljs-number">0</span>;
</code></pre>
<p>This policy ensures that only rows where <code>_peerdb_is_deleted</code> is 0 are visible when querying the table.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I hope you enjoyed reading the blog. I aimed to cover the most common data-modeling challenges you might encounter when migrating from PostgreSQL to ClickHouse. In the next blog, I plan to dive into more advanced topics, such as joins, writing efficient SQL queries, and so on. If you want to give PeerDB and ClickHouse a try to start replicating data from Postgres to ClickHouse, please check out the links below or reach out to us directly!</p>
<ol>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/cloud-quick-start">Try ClickHouse Cloud for Free</a></p>
</li>
<li><p><a target="_blank" href="https://auth.peerdb.cloud/signup">Try PeerDB Cloud for Free</a></p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-clickhouse">Docs on Postgres to ClickHouse Replication</a></p>
</li>
<li><p><a target="_blank" href="https://www.peerdb.io/sign-up">Talk to the PeerDB team directly</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Enhancing Postgres to ClickHouse replication using PeerDB]]></title><description><![CDATA[Providing a fast and simple way to replicate data from Postgres to ClickHouse has been a top priority for us over the past few months. Last month, we acquired PeerDB, a company that specializes in Postgres CDC. We're actively integrating PeerDB into ...]]></description><link>https://blog.peerdb.io/enhancing-postgres-to-clickhouse-replication-using-peerdb</link><guid isPermaLink="true">https://blog.peerdb.io/enhancing-postgres-to-clickhouse-replication-using-peerdb</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[migration]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[ETL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[change data capture]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Wed, 14 Aug 2024 17:03:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723654703773/5eed5a50-6d06-41c8-8b47-443be68be636.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Providing a fast and simple way to replicate data from <a target="_blank" href="https://www.postgresql.org/">Postgres</a> to ClickHouse has been a top priority for us over the past few months. Last month, we <a target="_blank" href="https://clickhouse.com/blog/clickhouse-welcomes-peerdb-adding-the-fastest-postgres-cdc-to-the-fastest-olap-database">acquired</a> <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, a company that specializes in Postgres CDC. We're actively integrating PeerDB into <a target="_blank" href="https://clickhouse.com/cloud/clickpipes">ClickPipes</a> to add Postgres as a source connector. Meanwhile, <a target="_blank" href="https://www.peerdb.io/">PeerDB</a> is the recommended solution for moving data from Postgres to ClickHouse.</p>
<p>In the past few months, the PeerDB team had the opportunity to work with multiple ClickHouse customers, helping them replicate billions of rows and terabytes of data from Postgres to ClickHouse. In this blog, we will take a deep dive into some of the top features that were released recently to make the replication experience rock-solid. These features focus on enhancing the speed, stability, and security of replication from Postgres to ClickHouse.</p>
<h2 id="heading-efficiently-flush-the-replication-slot">Efficiently flush the replication slot</h2>
<p>PeerDB uses Postgres Logical Replication Slots to implement Change Data Capture (CDC). Logical Replication Slots provide a stream of INSERTs, UPDATEs, and DELETEs occurring in the Postgres database. It is recommended to <a target="_blank" href="https://blog.peerdb.io/overcoming-pitfalls-of-postgres-logical-decoding#heading-always-consume-the-replication-slot">always consume the replication slot</a>. If the replication slot isn't consumed continuously, WAL files can accumulate, posing a risk of crashing the Postgres database.</p>
<p>To ensure that the logical replication slot is always consumed, we implemented a <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/1780">feature</a> to always read the replication slot and flush the changes to an internal stage (S3). An asynchronous process then consumes the changes from S3 and applies them to ClickHouse. Flushing the changes to the internal stage ensures S3 also ensures that replication slot is consumed even when the target (ClickHouse) is down.</p>
<h2 id="heading-better-memory-handling-on-clickhouse">Better memory handling on ClickHouse</h2>
<p>While replicating data from Postgres to ClickHouse, customers occasionally ran into memory-related issues on ClickHouse. This was more common when customers were on a free trial of ClickHouse and provisioned an instance with fewer resources (RAM and compute). PeerDB writes rows in batches to ClickHouse via <code>INSERT</code> queries and <code>INSERT SELECT</code> queries. We were seeing 2 types of issues:</p>
<ol>
<li><p>Some queries were failing because they were consuming more memory than allocated on the ClickHouse server.</p>
</li>
<li><p>Some queries would be killed by <a target="_blank" href="https://clickhouse.com/docs/en/operations/settings/memory-overcommit">ClickHouse's overcommit tracker.</a></p>
</li>
</ol>
<p>We attempted to thoroughly understand the various <a target="_blank" href="https://clickhouse.com/docs/en/operations/settings/settings">database settings</a> that ClickHouse provides, which influence memory utilization. Based on this, we <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/1728">modified</a> the following settings:</p>
<ol>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/operations/settings/settings#setting-max_block_size"><code>max_block_size</code></a>: This is useful for our <code>INSERT SELECT</code> queries, where this setting determines how many blocks are loaded by the <code>SELECT</code> and inserted. We reduced this with the hope that more blocks would reduce memory spikes when our queries are executed.</p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/operations/settings/settings#max_insert_block_size"><code>max_insert_block_size</code></a>: Similar to <code>max_block_size</code> except this applies to our <code>INSERT</code> queries. We reduced this for the same reason as above.</p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/operations/settings/settings#max_threads"><code>max_threads</code></a>: This setting controls the number of threads used for processing queries on ClickHouse. According to the documentation, the lower this number, the less memory is consumed. Therefore, we reduced this parameter.</p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/operations/settings/memory-overcommit#user-overcommit-tracker"><code>memory_overcommit_ratio_denominator</code></a>: This is related to the overcommit tracker mentioned earlier. We disabled this setting for our queries by setting it to 0.</p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/integrations/go#connection-settings-1"><code>dial_timeout</code></a>: Sometimes queries were taking longer than 1 minute, so we <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/1772">increased the <code>dial_timeout</code></a> to a higher value.</p>
</li>
</ol>
<p>These changes drastically reduced memory-related issues on smaller ClickHouse clusters. We are actively working with the core team to further fine-tune ClickHouse-specific settings. Additionally, we are working on a <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/1770">feature</a> that improves the handling of large datasets by breaking them into manageable parts for more efficient processing and storage.</p>
<h2 id="heading-row-level-transformations">Row-level transformations</h2>
<p>A few months ago, PeerDB shipped <a target="_blank" href="https://blog.peerdb.io/row-level-transformations-in-postgres-cdc-using-lua">Lua-based row-level transformations</a> while replicating data from Postgres to Queues such as Kafka. We have now extended this feature to ClickHouse. With this feature, customers can write simple Lua scripts to perform row-level transformations, enabling use cases such as masking PII data, generating columns, and more. Below is a quick demo of this feature to mask PII columns while replicating data from Postgres to ClickHouse:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/9d966c60f92e4eabb5e09d65fc9a3907?sid=8c9759a7-6d33-4b30-bc19-ecdaea6f6281">https://www.loom.com/share/9d966c60f92e4eabb5e09d65fc9a3907?sid=8c9759a7-6d33-4b30-bc19-ecdaea6f6281</a></div>
<p> </p>
<h2 id="heading-improved-security-on-peerdb-cloud">Improved security on PeerDB Cloud</h2>
<p>At PeerDB, safeguarding data replication from Postgres to ClickHouse is crucial. To enhance security, we have implemented several key measures around AWS S3, which we use for internally staging data before pushing it to ClickHouse.</p>
<h3 id="heading-temporary-credentials-with-iam-roles">Temporary credentials with IAM roles</h3>
<p>One significant enhancement is the use of AWS S3 buckets with strict access controls. Instead of traditional, long-lived user-generated access keys, which pose a higher risk of compromise, we use IAM roles to generate temporary credentials. These credentials are automatically rotated by AWS, ensuring they are always up-to-date and valid for only short periods, thus minimizing the risk of unauthorized access.</p>
<p>Additionally, with the introduction of the AWS_SESSION_TOKEN parameter in ClickHouse version 24.3.1, our security practices have been further strengthened. This update allows the use of short-lived credentials, aligning with our approach to secure data replication.</p>
<h3 id="heading-attribute-based-access-control-abac">Attribute Based Access Control (ABAC)</h3>
<p>In a multi-tenant environment, managing access to S3 buckets poses several challenges, such as ensuring tenant isolation, preventing unauthorized access, and minimizing role proliferation. To address these issues, we employ <strong>Attribute Based Access Control</strong> (ABAC). ABAC allows us to define dynamic, fine-grained access policies based on user roles, resource tags, and environmental variables. This method not only provides enhanced security but also improves scalability by eliminating the need for creating numerous roles. By using ABAC, we ensure that only authorized components can access sensitive data, maintaining a secure and manageable system.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Hope you enjoyed reading the blog. PeerDB has spent multiple cycles hardening Postgres CDC experience to ClickHouse and is now supporting multiple customers in replicating billions of records in real-time from Postgres to ClickHouse. If you want to give PeerDB and ClickHouse a try, please check out the links below or reach out to us directly!</p>
<ol>
<li><p><a target="_blank" href="https://auth.peerdb.cloud/signup">Try PeerDB Cloud for Free</a></p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/cloud-quick-start">Try ClickHouse Cloud for Free</a></p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-clickhouse">Docs on Postgres to ClickHouse Replication</a></p>
</li>
<li><p><a target="_blank" href="https://www.peerdb.io/sign-up">Talk to the PeerDB team directly</a></p>
</li>
</ol>
<h2 id="heading-references">References</h2>
<p>This blog is a replica of the original blog, which can be found <a target="_blank" href="https://clickhouse.com/blog/enhancing-postgres-to-clickhouse-replication-using-peerdb">here</a>.</p>
]]></content:encoded></item><item><title><![CDATA[ClickHouse acquires PeerDB for native Postgres CDC integration]]></title><description><![CDATA[We are thrilled to join forces with ClickHouse to make it seamless for customers to move data from their Postgres databases to ClickHouse and power real-time analytics and data warehousing use cases.
We released the ClickHouse target connector for Po...]]></description><link>https://blog.peerdb.io/clickhouse-acquires-peerdb-for-native-postgres-cdc-integration</link><guid isPermaLink="true">https://blog.peerdb.io/clickhouse-acquires-peerdb-for-native-postgres-cdc-integration</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[acquisition]]></category><category><![CDATA[Databases]]></category><category><![CDATA[postgres]]></category><category><![CDATA[replication]]></category><category><![CDATA[ETL]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Tue, 30 Jul 2024 13:53:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1722211525420/3247b7ce-862d-4421-ae67-812b9d52d781.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We are thrilled to join forces with ClickHouse to make it seamless for customers to move data from their Postgres databases to ClickHouse and power real-time analytics and data warehousing use cases.</p>
<p>We <a target="_blank" href="https://blog.peerdb.io/postgres-to-clickhouse-real-time-replication-using-peerdb">released</a> the ClickHouse target connector for Postgres Change Data Capture (CDC) earlier this year. Since then, ClickHouse has become the fastest-growing connector in terms of usage, surpassing other targets such as Snowflake and BigQuery. With this acquisition, we will be powering the Postgres CDC connector for <a target="_blank" href="https://clickhouse.com/cloud/clickpipes">ClickPipes</a>, the native integration engine that helps customers move data into ClickHouse.</p>
<h1 id="heading-our-thesis-behind-the-acquisition">Our thesis behind the acquisition</h1>
<p>A few months ago, when the prospect of an acquisition came up, we debated whether it was the right move for PeerDB. After much consideration, we decided that it was the best move. It all came down to 3 main reasons:</p>
<ol>
<li><p><strong>Amplifying Customer Value -</strong> ClickHouse was the fastest-growing target connector for Postgres CDC. We observed this firsthand at PeerDB, where we helped multiple customers move billions of rows from Postgres to ClickHouse. This aligns with the ClickHouse community's <a target="_blank" href="https://x.com/tbragin/status/1794052668601852068">feedback</a> on Postgres CDC. This acquisition will accelerate PeerDB's reach and make it accessible to thousands of ClickHouse customers, generating significant value.</p>
</li>
<li><p><strong>Postgres and ClickHouse: A Match Made in Heaven -</strong> Postgres is becoming the de facto operational (OLTP) database, while ClickHouse is the fastest analytical database on the planet. Both originate from the same ethos of open source and have a strong presence in the community. We believe that for customers using Postgres as their default OLTP database, ClickHouse is the natural OLAP counterpart. This is already reflected by prominent users like <a target="_blank" href="https://about.gitlab.com/blog/2022/04/29/two-sizes-fit-most-postgresql-and-clickhouse/">GitLab</a>, <a target="_blank" href="https://clickhouse.com/blog/langchain-why-we-choose-clickhouse-to-power-langchain">LangChain</a>, <a target="_blank" href="https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse">Cloudflare</a> and <a target="_blank" href="https://tech.instacart.com/real-time-fraud-detection-with-yoda-and-clickhouse-bd08e9dbe3f4">Instacart</a>, who run both databases in conjunction. This acquisition would bridge this gap further, making it easier for customers to use Postgres and ClickHouse together.</p>
</li>
<li><p><strong>We love the ClickHouse team</strong> - We have been closely collaborating with the ClickHouse team for the past several months. Their customer-first approach, strong emphasis on product quality, and growth mindset align with what PeerDB stands for. We felt that if we both joined forces, we could build something magical for customers.</p>
</li>
</ol>
<h1 id="heading-what-does-this-mean-for-existing-and-future-customers">What does this mean for existing and future customers?</h1>
<p>As part of this acquisition, a few important product decisions were made to ensure that the PeerDB community continues to thrive and that existing customers are not affected.</p>
<ol>
<li><p>PeerDB will remain <a target="_blank" href="https://github.com/PeerDB-io/peerdb">free and open</a> under the same ELv2 license.</p>
</li>
<li><p><a target="_blank" href="https://github.com/PeerDB-io/peerdb-enterprise">PeerDB Enterprise</a> offering that comes with production-grade Helm charts is being made free and open under the same ELv2 license. This means anyone can now run production-grade PeerDB workloads in a self-managed way for free!</p>
</li>
<li><p>The end-of-life (EOL) of PeerDB for existing paid customers using non-ClickHouse Cloud connectors will be <strong>one year</strong> from now, i.e. July 30th, 2025. Customers will receive the same support and SLAs as promised in their contracts, and we will assist them with the transition plan.</p>
</li>
<li><p>Until PeerDB is fully integrated into ClickPipes, we will be supporting PeerDB Cloud, the fully managed offering of PeerDB. For new customers from now on, <strong>PeerDB Cloud will only support ClickHouse Cloud as the target connector for Postgres CDC.</strong> If customers want to use other target connectors, they can use the free and open self-managed PeerDB options.</p>
</li>
</ol>
<h2 id="heading-peerdb-cloud-for-blazing-fast-postgres-cdc-to-clickhouse-cloud">PeerDB Cloud for blazing fast Postgres CDC to ClickHouse Cloud</h2>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/3efd88baae4c44c091a4afc9af699f2a">https://www.loom.com/share/3efd88baae4c44c091a4afc9af699f2a</a></div>
<p> </p>
<p>Until PeerDB is fully integrated into ClickPipes, customers can use PeerDB Cloud to replicate data from Postgres to ClickHouse Cloud. Over the last few months, we have dedicated <a target="_blank" href="https://github.com/PeerDB-io/peerdb/releases">multiple cycles</a> to enhancing speed, tightening security, and adding features to provide an enterprise-grade Postgres CDC experience for ClickHouse. You can follow the links below to get started:</p>
<ol>
<li><p><a target="_blank" href="https://app.peerdb.cloud/">Start Trial of PeerDB Cloud</a></p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-clickhouse">Postgres to ClickHouse CDC</a></p>
</li>
</ol>
<h2 id="heading-thank-you">Thank you!</h2>
<p>We would like to thank all of our customers and community for your constant support from our early days. Your trust and feedback have been instrumental in helping PeerDB get to where it is today. We are grateful for everything you've done, and the above product decisions are a small token of our gratitude. If you have any questions, please don't hesitate to reach out to us at <a target="_blank" href="mailto:founders@peerdb.io">founders@peerdb.io</a> or send a direct <a target="_blank" href="https://join.slack.com/t/peerdb-public/shared_invite/zt-1wo9jydev-EXInbMtCtpAKFFWdi7QvLQ">Slack</a> message. Thank you again!</p>
<h2 id="heading-references">References</h2>
<p>For more information on the acquisition, you can follow the below links:</p>
<ol>
<li><p><a target="_blank" href="https://clickhouse.com/blog/clickhouse-welcomes-peerdb-adding-the-fastest-postgres-cdc-to-the-fastest-olap-database">Blog post by ClickHouse</a></p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/blog/clickhouse-acquires-peerdb-to-boost-real-time-analytics-with-postgres-cdc-integration">Official Press Release</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[PeerDB is now SOC 2 Type 2 Compliant]]></title><description><![CDATA[At PeerDB, security has always been a top priority. Our customers trust us with their critical data, and we are dedicated to upholding the highest standards of data protection and security. We are excited to announce that PeerDB has achieved SOC 2 Ty...]]></description><link>https://blog.peerdb.io/peerdb-is-now-soc-2-type-2-compliant</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-is-now-soc-2-type-2-compliant</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Security]]></category><category><![CDATA[ETL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[compliance ]]></category><category><![CDATA[SOC 2 Type 2]]></category><category><![CDATA[privacy]]></category><category><![CDATA[Data security]]></category><category><![CDATA[data privacy]]></category><dc:creator><![CDATA[Kunal Gupta]]></dc:creator><pubDate>Wed, 19 Jun 2024 06:15:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1718777522218/537b239b-bc00-4f37-bf6b-0aa29a6edece.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, security has always been a top priority. Our customers trust us with their critical data, and we are dedicated to upholding the highest standards of data protection and security. We are excited to announce that PeerDB has achieved SOC 2 Type II compliance, demonstrating our unwavering commitment to maintaining a secure and reliable platform. <a target="_blank" href="https://trust.peerdb.io/resources?s=t6nuewj8c7b5p948qzhth&amp;name=soc-2-type-ii">Our SOC 2 report is now available in the Trust Center for viewing</a>.</p>
<h1 id="heading-what-is-soc-2"><strong>What is SOC 2?</strong></h1>
<p>SOC 2, or System and Organization Controls 2, is a framework governed by the American Institute of Certified Public Accountants (AICPA). It is designed to assess the controls and processes involved in storing, processing, and protecting customer data. SOC 2 reports focus on five Trust Services Criteria (TSC): Security, Availability, Processing Integrity, Confidentiality, and Privacy. Every SOC 2 report must cover Security, but organizations can choose to include additional criteria relevant to their operations.</p>
<h1 id="heading-soc-2-type-ii"><strong>SOC 2 Type II</strong></h1>
<p>A <a target="_blank" href="https://trust.peerdb.io/resources?s=t6nuewj8c7b5p948qzhth&amp;name=soc-2-type-ii">SOC 2 Type II report</a> goes beyond evaluating the design and implementation of controls at a single point in time. It assesses the operating effectiveness of these controls over a defined period, typically three months to a year. Achieving SOC 2 Type II compliance means that PeerDB has not only designed appropriate security controls but also maintained their effectiveness over time.</p>
<p><strong>The Journey to SOC 2 Compliance</strong></p>
<p>Our path to SOC 2 compliance was meticulous and comprehensive. Here’s a look at the steps we took:</p>
<ol>
<li><p><strong>Policy Crafting</strong>: Documenting all policies, procedures, and operational controls.</p>
</li>
<li><p><strong>Risk Assessment</strong>: Conducting thorough evaluations of our systems to identify and mitigate potential vulnerabilities.</p>
</li>
<li><p><strong>Vendor Management</strong>: Conducting thorough evaluations of our third-party vendors to ensure they meet our stringent security standards. We partnered with <a target="_blank" href="https://advantage-partners.com/">Advantage Partners</a>, who served as our auditor, to ensure that all vendors were compliant with our security requirements.</p>
</li>
<li><p><strong>Evidence Gathering</strong>: Collecting extensive evidence to demonstrate compliance with required controls. We partnered with <a target="_blank" href="https://www.vanta.com/">Vanta</a> to streamline this process, leveraging their platform to automate evidence collection and monitoring.</p>
</li>
</ol>
<h1 id="heading-why-soc-2-compliance-matters"><strong>Why SOC 2 Compliance Matters?</strong></h1>
<p>Achieving SOC 2 compliance is about more than just meeting regulatory requirements; it’s about building trust with our clients and partners. It underscores our commitment to maintaining the highest level of security and reliability.</p>
<h2 id="heading-benefits-for-our-customers"><strong>Benefits for Our Customers</strong></h2>
<ul>
<li><p><strong>Enhanced Security</strong>: SOC 2 compliance guarantees robust protection for your data, including advanced encryption and strict access controls.</p>
</li>
<li><p><strong>Transparency and Control</strong>: <a target="_blank" href="https://trust.peerdb.io/">Our Trust Center</a> provides detailed information about our security practices, giving you the assurance and control you need over your data.</p>
</li>
<li><p><strong>Ongoing Improvement</strong>: Our dedication to security doesn’t stop here. We continuously evaluate and enhance our measures to stay ahead of emerging threats.</p>
</li>
</ul>
<h1 id="heading-looking-ahead"><strong>Looking Ahead</strong></h1>
<p>While we celebrate this achievement, we are also focused on the future. We will continue to pursue additional certifications and audits to further validate our commitment to security excellence.</p>
<h2 id="heading-empowering-your-business-with-peerdb-cloud"><strong>Empowering Your Business with PeerDB Cloud</strong></h2>
<p><a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> offers a secure and scalable platform for all your Postgres Data Movement needs. With SOC 2 compliance at its core, <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> ensures that your data is protected in a robust and reliable environment. Additionally, <a target="_blank" href="https://blog.peerdb.io/peerdb-is-gdpr-compliant">we are already GDPR compliant</a>, representing cementing our dedication to data protection and privacy and creating a secure and trusted environment for all our clients.<br /><a target="_blank" href="https://docs.peerdb.io/peerdb-cloud/cloud-security">Our docs also summarize our Cloud Security posture</a> and provide a high level overview of what PeerDB Cloud offers in terms of data protection, compliance measures, and security best practices.</p>
<h2 id="heading-trust-and-assurance-at-peerdb">Trust and Assurance at PeerDB</h2>
<p>At PeerDB, we are dedicated to being a reliable partner in your digital journey. For more information on our SOC 2 compliance efforts, or any other security-related inquiries, please visit <a target="_blank" href="https://trust.peerdb.io/">our Trust Center</a>.</p>
<p>Thank you for being part of this journey with us. We look forward to continuing to provide secure and trusted solutions for all your data movement needs.</p>
]]></content:encoded></item><item><title><![CDATA[Overcoming Pitfalls of Postgres Logical Decoding]]></title><description><![CDATA[At PeerDB, we are building a fast and simple way to replicate data from Postgres to data warehouses like Snowflake, ClickHouse etc. and queues such as Kafka, Redpanda etc. We implement Postgres Change Data Capture (CDC) to reliably replicate changes ...]]></description><link>https://blog.peerdb.io/overcoming-pitfalls-of-postgres-logical-decoding</link><guid isPermaLink="true">https://blog.peerdb.io/overcoming-pitfalls-of-postgres-logical-decoding</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[change data capture]]></category><category><![CDATA[replication]]></category><category><![CDATA[ETL]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Databases]]></category><category><![CDATA[postgres]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Thu, 13 Jun 2024 20:00:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1718058082886/df7594ab-78dd-4c44-9af0-2edafd050f99.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At <a target="_blank" href="http://peerdb.io">PeerDB</a>, we are building a fast and simple way to replicate data from Postgres to data warehouses like Snowflake, ClickHouse etc. and queues such as Kafka, Redpanda etc. We implement <a target="_blank" href="https://blog.peerdb.io/peerdb-streams-simple-native-postgres-change-data-capture">Postgres Change Data Capture (CDC)</a> to reliably replicate changes from Postgres to other data stores. Postgres <a target="_blank" href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html">Logical Decoding</a> is a building block of Postgres CDC. It enables users to stream changes on Postgres as a sequence of logical operations like INSERTs, UPDATEs, and DELETEs.</p>
<p><a target="_blank" href="https://www.pgedge.com/blog/logical-replication-evolution-in-chronological-order-clustering-solution-built-around-logical-replication">Logical Decoding</a> has evolved quite a bit in the past few years in Postgres. However, there are a few quirks that users need to overcome. In this blog, we will summarize common issues and learnings from over 20 customers replicating more than 300 TB of data per month with logical decoding.</p>
<h1 id="heading-beware-of-replication-slot-growth-how-to-monitor-it">Beware of replication slot growth – how to monitor it?</h1>
<p>A <a target="_blank" href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS">logical replication slot</a> captures changes in the Postgres Write-Ahead Log (WAL) and streams them in a human-readable format to the client. A common issue with logical decoding is unexpected replication slot growth, which can risk filling up storage and causing server crashes. Slot growth mostly occurs when the consumer application (a.k.a. client) that reads changes from a replication slot lags or halts. The consumer application can lag for various reasons, including not consuming the slot appropriately and the high throughput on the Postgres database, combined with logical decoding being single-threaded. More on this topic in the next section.</p>
<p>You can monitor replication slot growth using the below queries:</p>
<pre><code class="lang-sql"><span class="hljs-comment">/* the below query should always return "true"
indicating slot is always getting consumed. */</span>
<span class="hljs-keyword">SELECT</span> slot_name,active <span class="hljs-keyword">FROM</span> pg_replication_slots ;

<span class="hljs-comment">/* monitor the size of the slot using the below query */</span>
<span class="hljs-keyword">SELECT</span> slot_name, 
 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn)) 
 <span class="hljs-keyword">AS</span> replication_lag_bytes 
 <span class="hljs-keyword">FROM</span> pg_replication_slots;
</code></pre>
<p><strong>NOTE:</strong> Just for reference, we open-sourced the <a target="_blank" href="https://blog.peerdb.io/pg-slot-notify-monitor-postgres-slot-growth-in-slack">PG Slot Notify bot</a>, which helps you monitor replication slot size. <a target="_blank" href="https://github.com/PeerDB-io/pgslot-notify-bot">pgslot-notify-bot</a> helps monitor PostgreSQL replication slots by sending alerts once the size threshold is reached.</p>
<h1 id="heading-tips-for-keeping-replication-slot-growth-in-check">Tips for keeping replication slot growth in check</h1>
<h2 id="heading-always-consume-the-replication-slot">Always consume the replication slot</h2>
<p>Logical decoding is a single-threaded process, whereas Postgres allows multiple concurrent connections/threads to ingest data. This means that if the client doesn't consume the replication slot fast enough, the slot can quickly grow.</p>
<p>The first step toward efficiency is to ensure that the client always consumes the replication slot and maximizes resource utilization. Intermittent reading of the slot with constant disconnections can slow down logical decoding. Periodic reconnections can lead to other inefficiencies, as logical decoding may need to restart from the beginning of the WAL instead of continuing the stream.</p>
<p>At PeerDB, we implemented this optimization. We ensure that the replication slot is always consumed and flushed to <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/1780">internal stages such as S3</a>.</p>
<h2 id="heading-beware-of-long-running-transactions">Beware of long-running transactions</h2>
<p>Long-running transactions can lead to WAL buildup. Since WAL is sequential, Postgres cannot flush the WAL until the long transaction completes, even as other transactions are being consumed. This can result in an increased slot size and slow down logical decoding. For each transaction being decoded, changes from long-running transactions that overlap with the current transaction must also be decoded again and again.</p>
<h3 id="heading-configure-statementtimeout-and-idleintransactionsessiontimeout-to-avoid-long-running-transactions"><strong>Configure</strong> <code>statement_timeout</code> <strong>and</strong> <code>idle_in_transaction_session_timeout</code> to avoid long running transactions</h3>
<p>Long-running transactions can occur either due to active queries running for a long time or stale transactions that were never committed. To avoid these scenarios, you should configure <code>statement_timeout</code>, which terminates queries that run longer than expected, and <code>idle_in_transaction_session_timeout</code>, which terminates stale transactions.</p>
<h2 id="heading-use-logical-replication-protocols">Use logical replication protocols</h2>
<p>The SQL API (<code>START_REPLICATION</code>) supports different versions of streaming, controlled via the <code>proto_version</code> parameter. The default <code>proto_version</code> (v1) allows clients to consume changes only from committed transactions. <code>proto_version</code> v2 allows clients to consume changes from in-flight transactions, improving performance by letting them process changes immediately without waiting for the COMMIT. However, it is the client's responsibility to handle transaction semantics. At PeerDB, we are working on a <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/1712">feature</a> that supports <code>proto_version</code> (v2).</p>
<p>This changes decoding from an O(N^2) operation to an O(N) operation and also helps address scenarios with long-running transactions. This <a target="_blank" href="https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol">blog</a> provides a deep dive into how replication slot growth is affected with v2.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706568527289/9278eb76-9e32-4916-8b12-0641636edd3a.png?auto=compress,format&amp;format=webp" alt /></p>
<p><strong>Reference:</strong> <a target="_blank" href="https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol">https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol</a></p>
<h2 id="heading-no-activity-can-lead-to-replication-slot-growth">No activity can lead to replication slot growth.</h2>
<p>It is common to see replication slots grow in size in dev/test/QA databases during periods of inactivity. In such scenarios, the WAL continues to grow due to maintenance processes like VACUUMs. To avoid this and ensure that the slot is consistently consumed by the client, you can follow one of the approaches below:</p>
<ol>
<li><p><strong>Include a heartbeat table</strong> that continuously gets updated in your replication pipeline. This ensures that the slot keeps moving. More details on this approach can be found <a target="_blank" href="https://docs.peerdb.io/bestpractices/heartbeat">here</a>.</p>
</li>
<li><p><strong>Use</strong> <a target="_blank" href="https://pgpedia.info/p/pg_logical_emit_message.html"><code>pg_logical_emit_message</code></a> to periodically emit a message in the WAL and ensure that this message is consumed by the client by confirming the LSN (Log Sequence Number) of the message.</p>
</li>
</ol>
<h2 id="heading-use-table-filtering-when-creating-publications">Use table filtering when creating PUBLICATIONs</h2>
<p>If you are replicating changes from only a few tables, ensure that you create a PUBLICATION that includes just those tables. Postgres efficiently persists changes for only those tables in the replication slot. This helps reduce the size of the replication slot and improves logical decoding performance.</p>
<h2 id="heading-some-useful-postgres-configs">Some useful Postgres configs</h2>
<h3 id="heading-maxslotwalkeepsize"><strong>max_slot_wal_keep_size</strong></h3>
<p>To keep your logical replication slots from consuming excessive disk space, set the <a target="_blank" href="https://postgresqlco.nf/doc/en/param/max_slot_wal_keep_size/"><code>max_slot_wal_keep_size</code></a> parameter. This config limits the amount of WAL (Write-Ahead Log) data a replication slot can retain. Choose a size that suits your environment to ensure old WAL files are removed when the limit is reached, preventing disk space issues.</p>
<h3 id="heading-logicaldecodingworkmem"><strong>logical_decoding_work_mem</strong></h3>
<p>To control memory usage during logical decoding, adjust the <a target="_blank" href="https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-LOGICAL-DECODING-WORK-MEM"><code>logical_decoding_work_mem</code></a> parameter. This setting allocates a specific amount of memory for the decoding process of each replication slot. Set a value that balances memory use and performance according to your system's capacity and workload. You can consider increasing <code>logical_decoding_work_mem</code> if you observe IO as the <code>wait_event</code> for the <code>START_REPLICATION</code> process. More details on tuning <code>logical_decoding_work_mem</code> can be found <a target="_blank" href="https://www.enterprisedb.com/postgres-tutorials/postgres-13-logicaldecodingworkmem-and-how-it-saves-your-server-going-out-memory">here</a>.</p>
<h1 id="heading-supporting-ddl-changes">Supporting DDL Changes</h1>
<p>One of the most common and well-known issues with logical decoding is that it doesn't capture schema changes such as adding or dropping columns, changing data types, adding new tables, and so on.</p>
<p>An approach that clients could follow is to leverage Relation and Type messages that logical decoding provides. Whenever columns are added or dropped, Postgres sends a Relation (<strong>'R'</strong>) message with the new schema, preceding the new row. Clients can perform a diff with the old schema to identify the new or dropped columns. A similar approach can be followed for supporting changing data types, where Postgres sends a Type (<strong>'T'</strong>) message.</p>
<p>Within PeerDB, we <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/480">implemented</a> the above intricate approach to support automatic schema changes, such as adding or dropping columns.</p>
<h1 id="heading-toast-columns-need-replica-identity-full">TOAST columns need REPLICA IDENTITY FULL</h1>
<p>Logical decoding doesn't capture TOAST (large sized &gt;8KB) columns that haven't been changed in an update operation. You need to enable <code>REPLICA IDENTITY FULL</code> for a table to capture values of unchanged TOAST columns.</p>
<p><a target="_blank" href="https://xata.io/blog/replica-identity-full-performance">Here</a> is a useful blog that talks about the impact of setting <code>REPLICA IDENTITY FULL</code> on your Postgres database. <strong>TL;DR:</strong><code>REPLICA IDENTITY FULL</code> might be fine for tables with primary keys or from Postgres 16, where indexes can be used on the subscriber side for searching the rows.</p>
<p>In PeerDB, for certain targets, we <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/111/">implemented</a> a method to replicate unchanged TOAST columns without requiring <code>REPLICA IDENTITY FULL</code>.</p>
<h1 id="heading-logical-decoding-doesnt-support-generated-columns">Logical decoding doesn’t support generated columns</h1>
<p>One limitation of Postgres logical decoding is that it doesn’t support generated columns. Generated columns are columns whose values are automatically computed from other columns in the table using a specified expression. Values of these columns appear as <code>NULL</code> for clients consuming the logical replication slots. A few workarounds incl:</p>
<ol>
<li><p><strong>In-flight transformations:</strong> Compute the value of generated columns while the data is in transit by performing transformations. PeerDB supports <a target="_blank" href="https://blog.peerdb.io/row-level-transformations-in-postgres-cdc-using-lua">row-level transformations</a> out-of-the-box to enable such generated column use cases.</p>
</li>
<li><p><strong>Extract, Load, and Transform (ELT):</strong> Another approach we've seen customers follow is to perform transformations once the data reaches the target. Customers often use transformation tools such as <a target="_blank" href="https://www.getdbt.com/lp/free-account?utm_medium=paid-search&amp;utm_source=google&amp;utm_campaign=q2-2024_us-brand_cv&amp;utm_content=cloud-account_kw-dbt-ex___&amp;utm_term=all_na_us&amp;utm_term=dbt&amp;utm_campaign=q2-2024_us-brand_cv&amp;utm_source=adwords&amp;utm_medium=ppc&amp;hsa_acc=8253637521&amp;hsa_cam=20002625512&amp;hsa_grp=147774946229&amp;hsa_ad=660676532053&amp;hsa_src=g&amp;hsa_tgt=kwd-95889999&amp;hsa_kw=dbt&amp;hsa_mt=e&amp;hsa_net=adwords&amp;hsa_ver=3&amp;gad_source=1&amp;gclid=Cj0KCQjwpZWzBhC0ARIsACvjWROh2AYD3CWBPyhCNudSTux9nFLIZVa_IB0k9HkRhyQ_b3NEJ3kSxKYaAmwtEALw_wcB">DBT</a> or <a target="_blank" href="https://coalesce.io/">Coalesce</a>.</p>
</li>
</ol>
<h1 id="heading-logical-replication-slots-dont-persist-on-postgres-upgrades">Logical Replication Slots Don't Persist on Postgres Upgrades</h1>
<p>Upgrading PostgreSQL versions presents a challenge because logical replication slots do not persist through upgrades. However, it's possible to manage upgrades without full resyncs by recreating the replication slot during maintenance.</p>
<p>The process is as follows:</p>
<ol>
<li><p><strong>Enter Maintenance Mode</strong>: Place the database in maintenance mode to prevent data changes.</p>
</li>
<li><p><strong>Upgrade PostgreSQL</strong>: Perform the PostgreSQL version upgrade.</p>
</li>
<li><p><strong>Recreate Replication Slot</strong>: Recreate the logical replication slot.</p>
</li>
<li><p><strong>Exit Maintenance Mode</strong>: Resume normal operations.</p>
</li>
</ol>
<p>This method ensures minimal disruption and avoids the need for a complete resync of the data.</p>
<h2 id="heading-starting-postgres-17-replication-slots-are-persisted-on-upgrades"><strong>Starting Postgres 17 replication slots are persisted on upgrades</strong></h2>
<p>Starting with PostgreSQL 17, logical replication slots will persist through version upgrades. This improvement will apply to future upgrades, such as from version 17 to 18, simplifying the upgrade process significantly.</p>
<h1 id="heading-logical-replication-slot-dont-persist-failovers">Logical Replication Slot don't persist Failovers</h1>
<p>Another issue with logical decoding is that replication slots don't persist during a failover, i.e., when a standby becomes primary. One potential solution is to implement retry logic in your clients to recreate the replication slot post-failover. However, this approach is not fully reliable and can incur data loss, as creating the slot right after failover without data ingestion is not trivial.</p>
<h2 id="heading-postgres-17-will-support-failover-slots">Postgres 17 will support failover slots</h2>
<p>The good news is that PostgreSQL 17 will support failover slots, allowing replication slots to persist automatically through failovers. This enhancement simplifies the failover process, ensures data reliability, and reduces manual intervention, resulting in more robust and resilient replication handling.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>At PeerDB, we are building a replication tool with a laser focus on Postgres. To provide the fastest and most reliable Postgres replication experience, we delve deeply into understanding Postgres logical decoding. This blog summarizes our efforts over the past year. Many of the challenges discussed above have already been addressed in our product, and for those that haven't, we work closely with our customers to find and implement workarounds. We hope you enjoyed reading the blog! If you want to give PeerDB a try, these links should prove useful: :)</p>
<ol>
<li><p><a target="_blank" href="https://github.com/PeerDB-io/peerdb">PeerDB's Github repo</a></p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/quickstart">Quickstart</a></p>
</li>
<li><p><a target="_blank" href="https://www.peerdb.io/sign-up">Directly reach out to us!</a></p>
</li>
<li><p><a target="_blank" href="https://join.slack.com/t/peerdb-public/shared_invite/zt-1wo9jydev-EXInbMtCtpAKFFWdi7QvLQ">Join PeerDB's Slack community</a></p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/introduction">PeerDB docs</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Postgres to Elasticsearch Real time Replication using PeerDB]]></title><description><![CDATA[Today, PeerDB is pleased to announce that our target connector for Elasticsearch is now in beta. Elasticsearch is a popular search engine system underpinned by a distributed document database, and we have been seeing a lot of use cases for Elasticsea...]]></description><link>https://blog.peerdb.io/postgres-to-elasticsearch-real-time-replication-using-peerdb</link><guid isPermaLink="true">https://blog.peerdb.io/postgres-to-elasticsearch-real-time-replication-using-peerdb</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[elasticsearch]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[replication]]></category><dc:creator><![CDATA[Kevin Biju]]></dc:creator><pubDate>Thu, 09 May 2024 14:53:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1715256386299/ca42c778-87b9-4acb-8024-2f180cf432de.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, <a target="_blank" href="https://peerdb.io">PeerDB</a> is pleased to announce that our target connector for Elasticsearch is now in beta. Elasticsearch is a popular search engine system underpinned by a distributed document database, and we have been seeing a lot of use cases for Elasticsearch in our customers' data stacks. This is our first connector for a document database, and we are excited to bring PeerDB's performance, reliability and value to users looking to move data from Postgres to Elasticsearch.</p>
<p>This post explains some of the use cases that are enabled by Postgres to Elasticsearch replication, followed by a quick demo showcasing the high performance and low latency of Postgres to Elasticsearch replication using PeerDB. Finally, we go through a high level overview of the architecture of the connector.</p>
<h2 id="heading-postgres-to-elasticsearch-replication-use-cases">Postgres to Elasticsearch Replication Use cases</h2>
<p>Some common use cases for Postgres to Elasticsearch replication via CDC or query replication are:</p>
<ol>
<li><p><strong>Efficient search for large ingest volumes</strong>: Elasticsearch's bread and butter use case is as a search engine operating efficiently even on humongous volumes of data. From full-text and weighted search to even complex semantic searches using built-in NLP models, Elasticsearch is very flexible and tunable. It is commonly used for ingesting and indexing large volumes of logs, and even as a backing engine for searching large websites and internal knowledge bases.</p>
</li>
<li><p><strong>Denormalizing data to documents:</strong> Data models are often stored in Postgres in a highly normalized form, which is great for transactional integrity but bad for complex queries where may have to use joins or CTEs. Elasticsearch being a document database prefers storing data in a denormalized form. Using PeerDB's query replication capabilities, you are able to periodically transform your data into a denormalized form which makes it more efficient for querying by downstream consumers. Some processing can also be done using an Elasticsearch <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipeline</a>.</p>
</li>
</ol>
<h2 id="heading-low-latency-replication-from-postgres-to-elasticsearch-using-peerdb">Low latency replication from Postgres to Elasticsearch using PeerDB</h2>
<p>In this section, I'll walk through a quick demonstration of Postgres to Elasticsearch replication using PeerDB in <strong>Change Data Capture (CDC) mode</strong>. Using PeerDB for replication from Postgres to Elasticsearch offers a few benefits, primary ones being <a target="_blank" href="https://blog.peerdb.io/how-can-we-make-pgdump-and-pgrestore-5-times-faster#heading-parallel-snapshotting-to-make-pgdump-amp-pgrestore-multi-threaded-per-table">blazing fast initial loads</a>, and <a target="_blank" href="https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol">sub minute latencies by constantly reading the slot</a>, PeerDB is able to offer these by being laser focused around Postgres replication.</p>
<h3 id="heading-postgres-setup">Postgres Setup</h3>
<p>You can use any Postgres database in the cloud or on-prem. For simplicity, I'm using a Postgres cluster running locally in a Docker container for this demo. We create a table named <code>oss1</code> with a continuous ingest of 1000 rows per second using a multi-valued insert statement.</p>
<pre><code class="lang-sql">postgres=<span class="hljs-comment"># CREATE TABLE oss1 (</span>
           id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
           c1 INT,
           c2 INT,
           t TEXT,
           updated_at TIMESTAMP <span class="hljs-keyword">WITH</span> <span class="hljs-built_in">TIME</span> ZONE <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>()
         );
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span>
postgres=<span class="hljs-comment"># INSERT INTO oss1 (c1, c2, t)</span>
<span class="hljs-keyword">SELECT</span>
    generate_series <span class="hljs-keyword">AS</span> c1,
    generate_series * <span class="hljs-number">2</span> <span class="hljs-keyword">AS</span> c2,
    <span class="hljs-string">'text_'</span> || generate_series <span class="hljs-keyword">AS</span> t
<span class="hljs-keyword">FROM</span> 
    generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">1000</span>); 
<span class="hljs-comment"># to run the INSERT once per second    </span>
postgres=<span class="hljs-comment"># \watch 1</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-number">0</span> <span class="hljs-number">1000</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-number">0</span> <span class="hljs-number">1000</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-number">0</span> <span class="hljs-number">1000</span>
</code></pre>
<h3 id="heading-elasticsearch-setup">Elasticsearch Setup</h3>
<p>You can setup an Elasticsearch instance using its <a target="_blank" href="https://github.com/deviantony/docker-elk">Docker compose setup</a> on-prem or on a cloud VM. Alternatively you can also use <a target="_blank" href="https://www.elastic.co/cloud">Elasticsearch Cloud</a>. For this demo, I am using the Docker compose setup running locally.</p>
<h3 id="heading-peerdb-setup">PeerDB Setup</h3>
<p>You can use <a target="_blank" href="https://github.com/PeerDB-io/peerdb">PeerDB Open Source</a> or <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> to deploy a PeerDB instance. For the scope of this demo, I'm deploying PeerDB open source locally via Docker compose.</p>
<h3 id="heading-create-peers-and-mirror-for-postgres-to-elasticsearch-replication">Create Peers and Mirror for Postgres to Elasticsearch Replication</h3>
<p>In the PeerDB world, peers refer to either source or target data stores. You can use PeerDB's UI to create the <a target="_blank" href="https://docs.peerdb.io/connect/rds_postgres#create-rds-postgres-peer-in-peerdb">Postgres</a> and the <a target="_blank" href="https://docs.peerdb.io/connect/elasticsearch">Elasticsearch</a> peers. A mirror is then created between a source peer and a destination peer for data replication. You can use PeerDB's UI to create a MIRROR for replicating data from Postgres to Elasticsearch.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1715116855623/35169594-b925-4c90-bdfd-45f6cd6a863b.png" alt class="image--center mx-auto" /></p>
<p>I have created a <a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-clickhouse"><strong>Change Data Capture (CDC) based MIRROR</strong></a> that replicates data using Postgres' <a target="_blank" href="https://www.postgresql.org/docs/current/wal-intro.html">Write-Ahead Log (WAL)</a> and <a target="_blank" href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html#LOGICALDECODING-EXPLANATION-LOG-DEC">Logical Decoding</a>. It involves two steps:</p>
<ol>
<li><p><strong>An initial load</strong> that takes a fully consistent snapshot of existing data in Postgres and copies it to Elasticsearch; Through PeerDB's <a target="_blank" href="https://blog.peerdb.io/parallelized-initial-load-for-cdc-based-streaming-from-postgres#heading-parallelized-initial-snapshot-for-cdc-based-streaming">parallel snapshotting</a>, you can expect significantly faster initial loads. We've seen terabytes of data moved in hours vs days.</p>
</li>
<li><p><strong>Change Data Capture (CDC):</strong> Once the initial load is completed, PeerDB constantly reads changes in Postgres through the logical replication slot and replicates those changes to Elasticsearch. Thanks to our streaming architecture, expect data latency in the range of seconds for a continuously running mirror to Elasticsearch.</p>
</li>
</ol>
<p>The initial load should complete pretty quickly, and rows should be present in the created Elasticsearch index. After entering continuous CDC mode, new rows should show up as and when they are inserted. Attached below is a quick video showing a Postgres to Elasticsearch CDC mirror.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/59af80e870dc4b6791e95e4a136ea8e1?sid=d5e9068f-1e6d-4b03-b688-49f4171ad057">https://www.loom.com/share/59af80e870dc4b6791e95e4a136ea8e1?sid=d5e9068f-1e6d-4b03-b688-49f4171ad057</a></div>
<p> </p>
<h2 id="heading-architecture-and-design-choices">Architecture and Design Choices</h2>
<p>We've <a target="_blank" href="https://blog.peerdb.io/building-a-streaming-platform-in-go-for-postgres">talked about</a> PeerDB's streaming architecture in detail before, but in summary PeerDB utilizes Go's goroutines and channels to efficiently read data from PostgreSQL using logical replication, and then pushes it to Elasticsearch in batches through the Bulk API. This approach enhances the execution time by enabling parallel processing.</p>
<p>Our data warehouse connectors store the data in a staging table before pushing it to the final table for cost and performance reasons. Due to Elasticsearch's architecture and query language, we are also able to avoid this intermediate step directly send the stream of processed records to Elasticsearch indices via the <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">bulk API</a>.</p>
<h3 id="heading-handling-updates-and-deletes-in-elasticsearch"><strong>Handling Updates and Deletes in Elasticsearch</strong></h3>
<p>PeerDB supports Elasticsearch as a target for both CDC and query replication. In most cases we recommend using CDC because of its ease of use, increased reliability and its ability to replicate DELETEs to Elasticsearch. However, this limits the scope of transformations that can be done before loading to Elasticsearch.</p>
<p>To support deduplication on the Elasticsearch side, we need a unique ID for each document that remains consistent so we can update or delete it as per the source. For tables with one column in the primary key, the value of the column itself can be used. For tables with multiple columns in the primary key, we instead choose to hash the values of the columns together, giving a small unique identifier irrespective of the width of the row.</p>
<pre><code class="lang-go"><span class="hljs-comment">// simplified Go code</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">primaryKeyColsHash</span><span class="hljs-params">(record []any, colIndices []<span class="hljs-keyword">int</span>)</span> <span class="hljs-title">string</span></span> {
    hasher := sha256.New()

    <span class="hljs-keyword">for</span> _, colIndex := <span class="hljs-keyword">range</span> colIndices {
        <span class="hljs-comment">// write the value to the hasher</span>
        _, _ = fmt.Fprint(hasher, record[colIndex])
    }
    hashBytes := hasher.Sum(<span class="hljs-literal">nil</span>)
    <span class="hljs-keyword">return</span> base64.RawURLEncoding.EncodeToString(hashBytes)
}
</code></pre>
<pre><code class="lang-json"># Sample document uploaded by PeerDB to Elasticsearch.
# Note how the _id field is a (base64 encoded) hash of the
# primary key columns id and c1.
{
  <span class="hljs-attr">"_index"</span>: <span class="hljs-string">"public.oss2"</span>,
  <span class="hljs-attr">"_id"</span>: <span class="hljs-string">"SAgdSqEaQyGYWxOo8Dj2s0DbXsQXLTC_CWlds8-c4kY"</span>,
  <span class="hljs-attr">"_version"</span>: <span class="hljs-number">1</span>,
  <span class="hljs-attr">"_seq_no"</span>: <span class="hljs-number">0</span>,
  <span class="hljs-attr">"_primary_term"</span>: <span class="hljs-number">1</span>,
  <span class="hljs-attr">"found"</span>: <span class="hljs-literal">true</span>,
  <span class="hljs-attr">"_source"</span>: {
    <span class="hljs-attr">"c1"</span>: <span class="hljs-number">434017</span>,
    <span class="hljs-attr">"c2"</span>: <span class="hljs-number">922856</span>,
    <span class="hljs-attr">"id"</span>: <span class="hljs-number">8</span>,
    <span class="hljs-attr">"t"</span>: <span class="hljs-string">"pgbenchinsertc4b998821cc6b161e65489b3"</span>,
    <span class="hljs-attr">"updated_at"</span>: <span class="hljs-string">"2024-05-08T18:33:39.031107Z"</span>
  }
}
</code></pre>
<p>Query replication can be done in append mode, where any change creates a fresh document in Elasticsearch or in upsert mode where some columns are designated as key columns which are deduplicated in a way similar to CDC.</p>
<h3 id="heading-dynamic-mapping-for-data-types"><strong>Dynamic Mapping for Data Types</strong></h3>
<p>By default, PeerDB currently uses Elasticsearch's <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html">dynamic mapping</a> to automatically infer a data type mapping based on the contents of the documents in an index. In practice, numeric types are mapped to either <code>long</code> or <code>float</code>, timestamp types are mapped to <code>date</code> and most other types map to <code>text</code>. A more detailed mapping is available <a target="_blank" href="https://docs.peerdb.io/datatypes/datatype-matrix">here</a>. This works for many use cases. If needed an explicit mapping can be provided by the user during manual index creation and PeerDB will load documents to this index.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Elasticsearch connector is in beta -- we already have customers who have moved billions of rows from Postgres to Elasticsearch using PeerDB. If you're an Elasticsearch user and wish to replicate data from Postgres to Elasticsearch using PeerDB, do give PeerDB a shot! We would love to help you out or get feedback:</p>
<ol>
<li><p><a target="_blank" href="https://app.peerdb.cloud/"><strong>Try PeerDB Cloud for free.</strong></a></p>
</li>
<li><p><a target="_blank" href="https://github.com/PeerDB-io/peerdb">Visit PeerDB's <strong>GitHub</strong> repository to Get Started.</a></p>
</li>
<li><p><a target="_blank" href="https://join.slack.com/t/peerdb-public/shared_invite/zt-1wo9jydev-EXInbMtCtpAKFFWdi7QvLQ">Join our Slack and say hi!</a></p>
</li>
</ol>
<p>Thanks for reading!</p>
]]></content:encoded></item><item><title><![CDATA[Row-level transformations in Postgres CDC using Lua]]></title><description><![CDATA[Earlier this week, we launched PeerDB Streams, our latest product offering for real-time replication from Postgres to queues and message brokers such as Kafka, Redpanda, Google PubSub, Azure Event Hubs, and others.
Today, we are announcing one of the...]]></description><link>https://blog.peerdb.io/row-level-transformations-in-postgres-cdc-using-lua</link><guid isPermaLink="true">https://blog.peerdb.io/row-level-transformations-in-postgres-cdc-using-lua</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[streaming]]></category><category><![CDATA[change data capture]]></category><category><![CDATA[kafka]]></category><category><![CDATA[Lua]]></category><category><![CDATA[postgres]]></category><category><![CDATA[Security]]></category><category><![CDATA[encryption]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Wed, 08 May 2024 15:16:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1715132433538/164a6649-1d6b-48ab-92eb-ec1a6cf7a405.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Earlier this week, we launched <a target="_blank" href="https://blog.peerdb.io/peerdb-streams-simple-native-postgres-change-data-capture">PeerDB Streams</a>, our latest product offering for real-time replication from Postgres to queues and message brokers such as <a target="_blank" href="https://kafka.apache.org/">Kafka</a>, <a target="_blank" href="https://redpanda.com/">Redpanda</a>, <a target="_blank" href="https://cloud.google.com/pubsub?hl=en">Google PubSub</a>, <a target="_blank" href="https://azure.microsoft.com/en-us/products/event-hubs">Azure Event Hubs</a>, and others.</p>
<p>Today, we are announcing one of the flagship features of this offering — support for <a target="_blank" href="https://docs.peerdb.io/lua/reference">row-level transformations</a> as part of Postgres Change Data Capture (CDC). You can write simple <a target="_blank" href="https://www.lua.org/">Lua</a> scripts to define a transformation and add it as part of the replication (MIRROR). With this feature, users will be able to seamlessly perform in-flight row-level transformations to Postgres data before it is streamed to the target.</p>
<p>In this blog, we will cover various use cases that require row-level transformations and how they can be accomplished using PeerDB. We will also walk through example use cases using sample Lua scripts. Toward the end, we will delve a bit deeper into why we chose Lua as the scripting language and how we implemented this feature.</p>
<h2 id="heading-row-level-transformation-in-postgres-cdc-use-cases">Row-Level Transformation in Postgres CDC: Use Cases</h2>
<p>There are multiple use cases that require row-level transformations during Postgres CDC. A few of the common scenarios include:</p>
<ol>
<li><p><strong>Masking PII Data</strong>: Replace sensitive PII with tokens or pseudonyms before data enters Kafka, obfuscating it from other micro-services in transactional outbox scenarios, thus enhancing privacy and compliance.</p>
</li>
<li><p><strong>Changing Data Format</strong>: Transform data into required formats like Protobuf, JSON, MsgPack, Avro and so on for seamless integration and optimized handling across systems.</p>
</li>
<li><p><strong>Generated Columns</strong>: Calculate new column values based on transformations of existing data, such as aggregations or derived metrics, and stream these new columns to enhance real-time data analysis and reporting.</p>
</li>
<li><p><strong>Unnesting JSONs</strong>: Extract elements from JSON objects and flatten them into separate fields within Kafka messages, improving the accessibility and usability of data across different consumer applications.</p>
</li>
<li><p><strong>Topic Routing:</strong> Distribute Change Data Capture (CDC) events to specific Kafka topics based on rules or conditions, facilitating targeted data streaming and processing.</p>
</li>
<li><p><strong>Data Encryption</strong>: Apply encryption to sensitive data before it is written to Kafka, enhancing security and preventing unauthorized access as data moves between systems.</p>
</li>
</ol>
<p>Now, let's see how some of the above use cases can be accomplished using PeerDB, through examples and sample Lua scripts.</p>
<h2 id="heading-sample-schema">Sample Schema</h2>
<p>I will be using the <code>users</code> table shown below to demonstrate the above use cases in PeerDB. This table includes various fields relevant to our testing scenarios."</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">users</span> (
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    first_name <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">255</span>),
    last_name <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">255</span>),
    ssn <span class="hljs-built_in">CHAR</span>(<span class="hljs-number">11</span>),
    payload JSONB,
    salary_in_usd <span class="hljs-built_in">NUMERIC</span>(<span class="hljs-number">10</span>, <span class="hljs-number">2</span>)
);
</code></pre>
<h2 id="heading-masking-the-ssn-column"><strong>Masking the SSN Column</strong></h2>
<p>You can create a simple Lua script to mask the SSN column in the users table and add it as part of the MIRROR. See below:</p>
<pre><code class="lang-sql">local json = require 'json'

local function maskSSN(ssn)
    if not ssn then
        return nil
    <span class="hljs-keyword">end</span>
    <span class="hljs-comment">-- Replace all but the last four digits of the SSN with "XXX-XX-"</span>
    <span class="hljs-keyword">return</span> string.gsub(ssn, <span class="hljs-string">"^(.-)(%d%d%d%d)$"</span>, <span class="hljs-string">"XXX-XX-%2"</span>)
<span class="hljs-keyword">end</span>

<span class="hljs-keyword">local</span> <span class="hljs-keyword">function</span> RowToMap(<span class="hljs-keyword">row</span>)
    <span class="hljs-keyword">local</span> <span class="hljs-keyword">map</span> = peerdb.RowTable(<span class="hljs-keyword">row</span>)
    <span class="hljs-keyword">for</span> <span class="hljs-keyword">col</span>, val <span class="hljs-keyword">in</span> pairs(<span class="hljs-keyword">map</span>) <span class="hljs-keyword">do</span>
        <span class="hljs-keyword">local</span> kind = peerdb.RowColumnKind(<span class="hljs-keyword">row</span>, <span class="hljs-keyword">col</span>)
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">col</span> == <span class="hljs-string">'ssn'</span> <span class="hljs-keyword">then</span>
            <span class="hljs-comment">-- Apply the maskSSN function to the SSN column</span>
            <span class="hljs-keyword">map</span>[<span class="hljs-keyword">col</span>] = maskSSN(val)
        elseif kind == <span class="hljs-string">'bytes'</span> <span class="hljs-keyword">or</span> kind == <span class="hljs-string">'bit'</span> <span class="hljs-keyword">then</span>
            <span class="hljs-keyword">map</span>[<span class="hljs-keyword">col</span>] = json.bin(val)
        <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">map</span>
<span class="hljs-keyword">end</span>

<span class="hljs-keyword">local</span> RKINDMAP = {
    <span class="hljs-keyword">insert</span> = string.byte(<span class="hljs-string">'i'</span>, <span class="hljs-number">1</span>),
    <span class="hljs-keyword">update</span> = string.byte(<span class="hljs-string">'u'</span>, <span class="hljs-number">1</span>),
    <span class="hljs-keyword">delete</span> = string.byte(<span class="hljs-string">'d'</span>, <span class="hljs-number">1</span>),
}

<span class="hljs-keyword">function</span> onRecord(r)
    <span class="hljs-keyword">local</span> kind = RKINDMAP[r.kind]
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> kind <span class="hljs-keyword">then</span>
        <span class="hljs-keyword">return</span>
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">local</span> <span class="hljs-built_in">record</span> = {
        <span class="hljs-keyword">action</span> = kind,
        lsn = r.checkpoint,
        <span class="hljs-built_in">time</span> = r.commit_time,
        <span class="hljs-keyword">source</span> = r.source,
    }
    <span class="hljs-keyword">if</span> r.old <span class="hljs-keyword">then</span>
        record.old = RowToMap(r.old)
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">if</span> r.new <span class="hljs-keyword">then</span>
        record.new = RowToMap(r.new)
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">return</span> json.encode(<span class="hljs-built_in">record</span>)
<span class="hljs-keyword">end</span>
</code></pre>
<p>PeerDB offers a straightforward <a target="_blank" href="https://docs.peerdb.io/lua/reference">script editor</a> for creating the Lua script to define the transformation. After that, you can add this transformation <a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-kafka">through the UI while creating the MIRROR</a>. See below demo for reference:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/cd9acb9be8a943a3a0e1921bd49a41b9?sid=be164ff7-bd78-4844-a61b-0969a3e69bbc">https://www.loom.com/share/cd9acb9be8a943a3a0e1921bd49a41b9?sid=be164ff7-bd78-4844-a61b-0969a3e69bbc</a></div>
<p> </p>
<p>You can try this yourself in just 10 minutes by following this <a target="_blank" href="https://docs.peerdb.io/quickstart/streams-quickstart">Quickstart guide</a>.</p>
<h2 id="heading-changing-data-format-to-msgpack"><strong>Changing Data Format to MsgPack</strong></h2>
<p>For the <code>users</code> table mentioned above, to change the data format to <a target="_blank" href="https://msgpack.org/index.html">MsgPack</a>, you can use this <a target="_blank" href="https://github.com/PeerDB-io/examples/blob/main/msgpack.lua">example Lua script</a>. We've seen a few of our customers use MsgPack because MsgPack is more efficient than JSON because it uses a compact binary format, which reduces data size and speeds up both data transmission and parsing.</p>
<p>See the 2-minute demo below, which shows how this is done with PeerDB.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/2fbc4cf1aafc4ea08d2eeda0ffbc127a?sid=6381cdd2-7c86-4d6c-8911-4262496e9d45">https://www.loom.com/share/2fbc4cf1aafc4ea08d2eeda0ffbc127a?sid=6381cdd2-7c86-4d6c-8911-4262496e9d45</a></div>
<p> </p>
<h2 id="heading-generated-additional-columns"><strong>Generated Additional Columns</strong></h2>
<p>For the users table, let's say I want to add a new column, <code>salary_in_cad</code>, as part of the replication, which converts the salary from dollars to Canadian dollars. You can create a script as shown in this <a target="_blank" href="https://github.com/PeerDB-io/examples/blob/main/msgpack.lua">example</a> (see below) and add it as part of the MIRROR. Below is a snippet of the Lua script that does</p>
<pre><code class="lang-sql">local json = require "json"

local function RowToMap(row)
    if not row then
        return
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">local</span> <span class="hljs-keyword">map</span> = peerdb.RowTable(<span class="hljs-keyword">row</span>)
    map.salary_in_cad = map.salary_in_usd * <span class="hljs-number">1.4</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">map</span>
<span class="hljs-keyword">end</span>

<span class="hljs-keyword">local</span> OPMAP = {
    <span class="hljs-keyword">insert</span> = <span class="hljs-string">"c"</span>,
    <span class="hljs-keyword">update</span> = <span class="hljs-string">"u"</span>,
    <span class="hljs-keyword">delete</span> = <span class="hljs-string">"d"</span>,
}

<span class="hljs-keyword">function</span> onRecord(<span class="hljs-built_in">record</span>)
    <span class="hljs-keyword">local</span> op = OPMAP[record.kind]
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> op <span class="hljs-keyword">then</span>
        <span class="hljs-keyword">return</span>
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">return</span> json.encode {
        op = op,
        <span class="hljs-keyword">before</span> = RowToMap(record.old),
        <span class="hljs-keyword">after</span> = RowToMap(record.new),
        commitms = record.commit_time.unix_milli,
        <span class="hljs-keyword">table</span> = record.source,
        lsn = record.checkpoint,
    }
<span class="hljs-keyword">end</span>
</code></pre>
<h2 id="heading-unnesting-the-payload-jsonb-column"><strong>Unnesting the Payload JSONB column</strong></h2>
<p>For the users table, let's say you want to unnest the payload JSONB column to separate fields in Kafka. You can create the script as shown in this <a target="_blank" href="https://github.com/PeerDB-io/examples/blob/main/unnest.lua">example</a> and add it as a part of the <a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-kafka">MIRROR</a>.</p>
<h2 id="heading-distribute-load-of-the-users-table-across-topics">Distribute Load of the Users Table Across Topics</h2>
<p>For the users table, let's say you want to distribute the load across two Kafka topics, with odd IDs going to one topic and even IDs going to the other. You can create a script as shown in this <a target="_blank" href="https://github.com/PeerDB-io/examples/blob/main/topic_routing.lua">example</a> (see below) and add it as part of the <a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-kafka">MIRROR</a>.</p>
<pre><code class="lang-sql">local bit32 = require "bit32"
local json = require "json"

local OPMAP = {
    <span class="hljs-keyword">insert</span> = <span class="hljs-string">"c"</span>,
    <span class="hljs-keyword">update</span> = <span class="hljs-string">"u"</span>,
    <span class="hljs-keyword">delete</span> = <span class="hljs-string">"d"</span>,
}

<span class="hljs-keyword">function</span> onRecord(<span class="hljs-built_in">record</span>)
    <span class="hljs-keyword">local</span> op = OPMAP[record.kind]
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> op <span class="hljs-keyword">then</span>
        <span class="hljs-keyword">return</span>
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">local</span> topic
    <span class="hljs-keyword">if</span> bit32.btest(record.row.id, <span class="hljs-number">1</span>) <span class="hljs-keyword">then</span>
        topic = <span class="hljs-string">"odd"</span>
    <span class="hljs-keyword">else</span>
        topic = <span class="hljs-string">"even"</span>
    <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">return</span> {
        topic = topic,
        <span class="hljs-keyword">value</span> = json.encode {
            op = op,
            <span class="hljs-keyword">before</span> = record.old,
            <span class="hljs-keyword">after</span> = record.new,
            commitms = record.commit_time.unix_milli,
            <span class="hljs-keyword">table</span> = record.source,
            lsn = record.checkpoint,
        }
    }
<span class="hljs-keyword">end</span>
</code></pre>
<h2 id="heading-why-we-chose-lua">Why we chose Lua?</h2>
<p>We had multiple options, including WASM, but we chose Lua as it provides a fine balance with respect to engineering velocity, integration with our Go-based platform, and end-user usability. Here are the key reasons for our decision:</p>
<ul>
<li><p><strong>In-process scripting:</strong> Lua supports in-process scripting, avoiding the serialization and deserialization steps required by external plugin systems.</p>
</li>
<li><p><strong>Simplicity and flexibility:</strong> Lua's straightforward design as a glue language makes it easy to embed in various projects, with multiple robust implementations available.</p>
</li>
<li><p><strong>Compatibility with Go:</strong> Lua works well with Go’s garbage collector, simplifying memory management compared to using alternatives like WASM, which would necessitate complex integration with Go's memory management.</p>
</li>
<li><p><strong>Ease of use for end-users:</strong> Lua is an embedded language that allows for on-the-fly scripting without the need for compilation or additional setup steps, unlike systems like Debezium that use Java.</p>
</li>
<li><p><strong>Long-standing presence and resources:</strong> Although Lua is a verbose language, its long-standing presence has resulted in a wealth of resources. This also enables LLM-based coding assistants to be quite accurate, helping users to easily script out row-level transformations.</p>
</li>
</ul>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>We hope you enjoyed reading the blog. We think custom transformations offer a lot of added flexibility and enables many use cases. If you use Kafka, Pub-Sub, Redpanda or any other queues and wish to replicate data from Postgres to these using PeerDB, please check out the links below or reach out to us directly!</p>
<ol>
<li><p><a target="_blank" href="https://docs.peerdb.io/quickstart/streams-quickstart"><strong>PeerDB Streams Quickstart</strong></a></p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-kafka"><strong>Docs on Postgres to Kafka Replication</strong></a></p>
</li>
<li><p><a target="_blank" href="https://app.peerdb.cloud/"><strong>Try PeerDB Cloud for free</strong></a></p>
</li>
<li><p><a target="_blank" href="https://github.com/PeerDB-io/peerdb"><strong>Visit PeerDB's GitHub r</strong>epo<strong>sitory to Get Started</strong></a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[PeerDB Cloud is Now in Public Beta!]]></title><description><![CDATA[🚀 Today, we're excited to announce that PeerDB Cloud is officially entering public beta. If you're a data engineer or an organization looking for a fast, simple, and cost-effective way to replicate data from Postgres to data warehouses such as Snowf...]]></description><link>https://blog.peerdb.io/peerdb-cloud-is-now-in-public-beta</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-cloud-is-now-in-public-beta</guid><category><![CDATA[data-movement]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[ETL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[snowflake]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[kafka]]></category><category><![CDATA[redpanda]]></category><category><![CDATA[ELT]]></category><category><![CDATA[data pipeline]]></category><category><![CDATA[data-warehousing]]></category><category><![CDATA[replication]]></category><category><![CDATA[bigquery]]></category><dc:creator><![CDATA[Kaushik Iska]]></dc:creator><pubDate>Tue, 07 May 2024 15:51:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1715096989241/c1b9f9c6-16b5-439e-aaf7-5396402b3a9a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>🚀 Today, we're excited to announce that <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> is officially entering public beta. If you're a data engineer or an organization looking for a fast, simple, and cost-effective way to replicate data from Postgres to data warehouses such as Snowflake, BigQuery, and ClickHouse, or to queues such as Kafka, Redpanda, and Google PubSub, PeerDB Cloud is ready to serve you. If you want to be white-glove onboarded to PeerDB Cloud by the founder directly, you can <a target="_blank" href="https://calendly.com/sai-peerdb/peerdb-cloud-onboarding">book some time here</a>.</p>
<p>We've been operating PeerDB Cloud in Private Beta for the past three months. As the system has matured and we've had the privilege of serving a growing number of customers, we're thrilled to now launch it into Public Beta.</p>
<h3 id="heading-what-is-peerdb-cloud"><strong>What is PeerDB Cloud?</strong></h3>
<p><a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> is the fully managed offering of PeerDB. It is the easiest way to get started with <a target="_blank" href="http://peerdb.io">PeerDB</a> — in just a couple of clicks, you can have a production-ready PeerDB instance provisioned and a worry-free approach to Postgres replication. PeerDB Cloud comes bundled with PeerDB's core features like:</p>
<h3 id="heading-peerdb-cloud-comes-with-all-peerdb-features">PeerDB Cloud comes with all PeerDB features</h3>
<ol>
<li><p><a target="_blank" href="https://blog.peerdb.io/"><strong>Postgres Change Data Capture</strong></a> with latencies of less than 1 minute for Data Warehouses and single-digit milliseconds for queues.</p>
</li>
<li><p><a target="_blank" href="https://docs.peerdb.io/features/supported-connectors"><strong>High-quality target connectors</strong></a>, for Data Warehouses such as Snowflake, BigQuery, Postgres, ClickHouse, etc., and Queues such as Kafka, Redpanda, Google PubSub, Azure Event Hubs, and so on.</p>
</li>
<li><p><strong>Blazing Fast Parallel Initial Loads and Re-syncs:</strong> PeerDB is <a target="_blank" href="https://blog.peerdb.io/benchmarking-postgres-replication-peerdb-vs-airbyte">10x faster</a> compared to other tools. You can move Terabytes of data in few hours vs days.</p>
</li>
<li><p><strong>Streaming Query Replication</strong> for production-ready replication based on watermark columns.</p>
</li>
<li><p><strong>Web UI and Unique SQL Interface for ETL:</strong> Easily manage your data with our intuitive Web UI and SQL interface.</p>
</li>
<li><p>And <a target="_blank" href="http://github.com/PeerDB-io/peerdb">many more</a></p>
</li>
</ol>
<h3 id="heading-peerdb-cloud-is-fully-managed-0-capex-and-opex-costs">PeerDB Cloud is fully managed - 0 CAPEX and OPEX costs</h3>
<p>In addition, <strong>PeerDB Cloud provides a fully-managed production-ready experience supporting enterprise-grade features such as:</strong></p>
<ol>
<li><p><strong>High Availability (HA):</strong> Every PeerDB Cloud instance comes with HA. Under the hood, we have replica instances across Availability Zones and mechanisms for auto failover as needed.</p>
</li>
<li><p><strong>Horizontal Autoscaling:</strong> As your replication load increases, we have mechanisms to auto-scale compute resources as needed.</p>
</li>
<li><p><strong>In-Place/Transparent Upgrades:</strong> Enjoy hassle-free rolling upgrades with no downtime. This helps keep you up-to-date with all the latest features and ensures that you stay current with no extra effort.</p>
</li>
<li><p><strong>Advanced Logs and Metrics:</strong> Monitor your system effectively with detailed logs and metrics. OpenTelemetry endpoint support is coming soon.</p>
</li>
<li><p><strong>Privacy and Security:</strong> Privacy and security are our top priorities at PeerDB Cloud, surpassing even performance and functionality. We offer SSH tunneling for secure connections and ensure encryption at rest and in transit. Our platform is GDPR compliant and is currently undergoing a SOC2 audit (2 months in), with compliance expected by mid-June. Here is our <a target="_blank" href="https://trust.peerdb.io/">trust report</a>.</p>
</li>
<li><p><strong>Dedicated Slack Channel &amp; Support SLAs:</strong> Every PeerDB Cloud customer gets a dedicated slack channel for expert guidance during implementation, migration and post-prod support. PeerDB Cloud also comes with below Support SLAs:</p>
<ol>
<li><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1715062042583/ab1914ce-d82b-484e-9d9a-bf6ec3cc5728.png" alt class="image--center mx-auto" /></li>
</ol>
</li>
</ol>
<h3 id="heading-save-up-to-5x-costs-and-predictable-pricing">Save up to 5x costs and Predictable Pricing</h3>
<p>Being laser-focused on Postgres replication, we have implemented multiple Postgres-native and infrastructural features to optimize costs. Here is our white paper that provides a detailed summary of all these optimizations. With this, we are able to save up to 5x in costs for our customers compared to other tools. The graph below shows how PeerDB compares to other data-movement tools (<a target="_blank" href="https://docs.google.com/spreadsheets/d/1U03oxmfZtY8TkKnEwrWcr9o-lRUF-fpEzE5MWD391MI/edit#gid=0">reference</a>)</p>
<p>In addition to this, PeerDB Cloud provides a predictable <a target="_blank" href="https://www.peerdb.io/#prices">pricing model</a>. Instead of charging based on the number of rows or the amount of data moved, you just pay for the vCPUs provisioned. This ensures that as your workload scales, your costs don't skyrocket.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1715060884659/0034ac4e-4341-4aca-ba25-b8b31d19f241.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-current-metrics">Current Metrics</h3>
<p>These behind-the-scenes metrics for PeerDB Cloud showcase our progress and reinforce our confidence in launching it to Public Beta.</p>
<p><strong>Volume of data moved in PeerDB cloud:</strong> PeerDB Cloud already serves 10+ production customers and is replicating 20TB of data from Postgres every week, amounting to approximately 100TB per month. The graph below shows the day-over-day growth over the past week. Note that the graph below captures Avro compressed data; if uncompressed, the volume would be significantly higher, ranging anywhere from 200 to 400TB of uncompressed data moved per month.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1715003521626/67f1e3ad-bf04-451e-b411-ad9c34a9326d.png" alt class="image--center mx-auto" /></p>
<p><strong>Sign-ups for PeerDB Cloud:</strong> Signups have grown at a consistent pace over the past few months, increasing by 100% month over month.</p>
<h3 id="heading-our-customers"><strong>Our Customers</strong></h3>
<p>In just three months, we have over 10 production customers using PeerDB Cloud for production-grade Postgres replication. Our customers are spread across various verticals including Fintech, IoT, Retail, Sales Marketing Automation, and more. Below is a snapshot of a few of our publicly referenceable customers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1715004122364/e6263fc6-9b9b-4400-8e65-b4b96d02a8dd.png" alt class="image--center mx-auto" /></p>
<p>Here is a link to our customer stories. They demonstrate how our customers were able to achieve 10 times faster and up to 5 times cheaper Postgres replication experiences with PeerDB. Here are a few testimonials:</p>
<p><em>"Our decision to choose PeerDB was reaffirmed by their comprehensive online resources, which instilled confidence in their expertise. Not only did they help us cut costs effectively, but their unparalleled customer service provided immediate and insightful assistance, making us feel supported and empowered in managing our PostgreSQL database."<strong>**- Sang Mercado, Head of Engineering, Harmonic AI</strong></em></p>
<p><em>“We’re using this connector already for our Postgres to ClickHouse ETL and it’s insanely fast and accurate! Can’t believe how well this works. The PeerDB team has been super helpful in getting us set up, helping us debug, and advising us on everything related to ClickHouse and Postgres. Great work guys!!”<strong>**- Neel Mehta, CTO of Fiber AI</strong></em></p>
<h3 id="heading-growing-with-postgres"><strong>Growing with Postgres 📈</strong></h3>
<p>Postgres has solidified itself as one of the most popular developer databases ever created. While MySQL lost some appeal after being acquired by Oracle, Postgres continues to earn developer trust. In 2023, it topped the Stack Overflow Developer Survey, and was named DBMS of the Year by DB-Engines.</p>
<p>At PeerDB, we believe that Postgres is going to become the database of the world. We are dedicated to contributing to this vision by making it effortless for any Postgres user to implement use cases involving data movement and ETL for Postgres. Providing a fully managed Postgres replication experience through PeerDB Cloud is a step in that direction.</p>
<h3 id="heading-future-roadmap"><strong>Future Roadmap</strong></h3>
<p>We're committed to continuous improvement and have exciting new features in development:</p>
<ul>
<li><p>OpenTelemetry endpoint to integrate with your own monitoring tools such as DataDog, PagerDuty, OpsGenie, and more.</p>
</li>
<li><p>WebHooks and REST API integration to create and manage PEERs and MIRRORs.</p>
</li>
<li><p>Expanding from AWS to other clouds incl. GCP and Azure</p>
</li>
<li><p>Support for private link to securely connect to your VPC. This feature is currently in private preview.</p>
</li>
<li><p>PeerDB Cost Analysis - Fully visibility into your Data Warehouse costs and how to optimize them</p>
</li>
</ul>
<h2 id="heading-join-the-peerdb-cloud-community">Join the PeerDB Cloud Community!</h2>
<p>💡 Ready to see what PeerDB Cloud can do for your data?</p>
<ul>
<li><p><a target="_blank" href="https://app.peerdb.cloud/"><strong>Sign-Up</strong></a> for PeerDB Cloud's public beta today.</p>
</li>
<li><p><a target="_blank" href="https://calendly.com/sai-peerdb/30min"><strong>Book a Chat</strong></a> with PeerDB founders to discuss how PeerDB can transform your data strategy.</p>
</li>
<li><p><a target="_blank" href="http://github.com/PeerDB-io/peerdb">Star our GitHub repo</a></p>
</li>
<li><p><a target="_blank" href="https://join.slack.com/t/peerdb-public/shared_invite/zt-1wo9jydev-EXInbMtCtpAKFFWdi7QvLQ"><strong>Join our Slack Channel</strong></a> to connect with the PeerDB community.</p>
</li>
</ul>
<p>PeerDB Cloud is poised to reshape how you manage and replicate your data. Start your journey with us now!</p>
]]></content:encoded></item><item><title><![CDATA[PeerDB Streams - Simple, Native Postgres Change Data Capture]]></title><description><![CDATA[We spent the past 7 months building a solid experience to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse and Postgres.
Now, we want to expand and bring a similar experience for Queues. With that spirit, we are...]]></description><link>https://blog.peerdb.io/peerdb-streams-simple-native-postgres-change-data-capture</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-streams-simple-native-postgres-change-data-capture</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[change data capture]]></category><category><![CDATA[kafka]]></category><category><![CDATA[kafka topic]]></category><category><![CDATA[ETL]]></category><category><![CDATA[ELT]]></category><category><![CDATA[postgres]]></category><category><![CDATA[Databases]]></category><category><![CDATA[replication]]></category><category><![CDATA[streaming]]></category><category><![CDATA[debezium]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Mon, 06 May 2024 16:21:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1714852792511/699964b9-06b4-499a-a0d6-ce71ef4138d2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We spent the past 7 months building a solid experience to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse and Postgres.</p>
<p>Now, we want to expand and bring a similar experience for Queues. With that spirit, we are excited to announce <strong>PeerDB Streams</strong>. PeerDB Streams provides a simple and native way to replicate changes as they happen in Postgres to Queues / message brokers such as Kafka, Redpanda, Google PubSub, Azure Event Hubs, and so on. Under the hood, PeerDB Streams uses Postgres <a target="_blank" href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html">logical decoding</a> to enable Postgres Change Data Capture (CDC).</p>
<h1 id="heading-the-problem">The Problem</h1>
<p>We selected Queues as our next target because we've heard from multiple Postgres users that existing CDC tools are complex and have a significant learning curve. <a target="_blank" href="https://github.com/debezium/debezium/">Debezium</a> is the most common technology for this use-case. It is proven and has large production usage. However, a common pain point among our users is that Debezium has a significant learning curve and requires institutional knowledge to set up and manage in production. It takes a few months to fully deploy Debezium in production. A few common issues from users include -</p>
<ol>
<li><p>Interacting through a command line interface or configuration files, understanding the various options / settings, and learning best practices for running Debezium in production requires a significant learning curve. Debezium UI, released to <a target="_blank" href="https://debezium.io/blog/2020/10/22/towards-debezium-ui/">address usability concerns</a>, is still in an <a target="_blank" href="https://debezium.io/documentation/reference/stable/operations/debezium-ui.html">incubating state</a>. Additionally, reading Debezium docs/resources to get started can be <a target="_blank" href="https://medium.com/@cooper.wolfe/i-hated-debezium-so-much-i-did-it-myself-b43b0efc20a9">overwhelming</a> and not the most approachable.</p>
</li>
<li><p>Supporting data formats (ex: MsgPack) and transformations is not trivial and incurs an additional <a target="_blank" href="https://stackoverflow.com/questions/71381819/make-a-custom-transform-for-kafka-cdc-and-debezium">learning curve</a>. You need to write a Java project, build JAR packages and set up a runtime path on the kafka connect plugin. It isn’t as simple as plugging in a premade template or writing a few lines of code.</p>
</li>
<li><p>Debezium is not as native as Kafka for other types of message brokers and does not offer the same level of configurability. For example, with Event Hubs, it is difficult to define custom partitioning schemes and stream to topics spread across namespaces and subscriptions.</p>
</li>
</ol>
<p><strong>TL;DR</strong> We believe that Debezium aims to provide a comprehensive experience for engineers to implement CDC rather than making it dead simple for them. So you can do a lot with Debezium but need to know a lot about it.</p>
<h1 id="heading-peerdb-streams-simple-native-postgres-change-data-capture-cdc"><strong>PeerDB Streams - Simple, Native Postgres Change Data Capture (CDC)</strong></h1>
<p>This is what we want to address with PeerDB. We are building a Simple, yet Comprehensive experience for Postgres Change Data Capture (CDC). The goal is to enable engineers to implement production-grade Postgres CDC with a minimal learning curve, within a few days.</p>
<p>PeerDB’s feature-set isn't at Debezium's level yet, and as PeerDB evolves, we might face similar usability challenges. However, we're putting Simplicity/Usability at the forefront and we believe that we can achieve the above goal. Here is how we are doing it –</p>
<h2 id="heading-simple-postgres-cdc-using-peerdb-ui"><strong>Simple Postgres CDC Using PeerDB UI</strong></h2>
<p>First and foremost, PeerDB offers a simple UI to set up source and target data sources (such as Postgres and Kafka) by creating PEERs and initiating CDC by creating a MIRROR.</p>
<p>Through the UI, users can monitor the progress of CDC, including throughput (per table) and latency; search through logs; set up alerts to Slack or Email based on replication slot growth; investigate Postgres-specific metrics, including slot size, wait events for replication, and more. The UI also offers advanced features, including tuning MIRRORs, pausing MIRRORs, adding tables to MIRRORs, and more. We have strived to make these features as intuitive as possible for users, for example, by using information toolbars and simple language. Below is a demo showing of PeerDB UI in action. Here a <a target="_blank" href="https://docs.peerdb.io/quickstart/streams-quickstart">link</a> to the quick start for you to try PeerDB Streams in just a few minutes.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/ebcfb7646a1e48738835853b760e5d04?sid=a50b2865-48df-4ba7-94d4-631c2a778464">https://www.loom.com/share/ebcfb7646a1e48738835853b760e5d04?sid=a50b2865-48df-4ba7-94d4-631c2a778464</a></div>
<p> </p>
<h2 id="heading-enhanced-cli-experience-intuitive-sql-layer-for-managing-postgres-cdc"><strong>Enhanced CLI Experience: Intuitive SQL Layer for Managing Postgres CDC</strong></h2>
<p>Second, for users who prefer a CLI over the UI, we provide a Postgres-compatible SQL layer to initiate and manage CDC. This SQL layer offers the same level of comprehensiveness as the UI and we believe that it is far more intuitive and user-friendly compared to bash scripts and configuration files.</p>
<h2 id="heading-simple-lua-scripts-for-row-level-transformations"><strong>Simple Lua Scripts for Row-Level Transformations</strong></h2>
<p>Third, users can perform row-level transformations before streaming CDC changes to Kafka. They can write Lua scripts to execute these transformations. This enables powerful features such as encrypting/masking personally identifiable information (PII), supporting various data formats (JSON, MsgPack, Flatbuffers, Protobuf, etc.), and more. To make it very simple for users, we offer a script editor along with a bunch of useful <a target="_blank" href="https://github.com/PeerDB-io/examples">templates</a>. Additionally, applying a transformation is optional, with the default data format being JSON.</p>
<h2 id="heading-native-connectors-to-non-kafka-targets">Native Connectors to non-Kafka targets</h2>
<p>Fourth, we offer <strong>native</strong> connectors to non-Kafka targets, including Google Pub/Sub and Azure Event Hubs. Behind the scenes, we utilize the native Go APIs/libraries provided by these services to build our connectors, instead of relying on the less <a target="_blank" href="https://github.com/Azure/azure-event-hubs-for-kafka?tab=readme-ov-file#other-issues">developed</a> Kafka-compatible layer over these queues. We support advanced features specific to these services. For example, with <a target="_blank" href="https://blog.peerdb.io/enterprise-grade-replication-from-postgres-to-azure-event-hubs">Azure Event Hubs</a>, users can perform CDC to topics distributed across different namespaces and subscriptions.</p>
<h3 id="heading-peerdb-streams-is-postgres-native">PeerDB Streams is Postgres Native</h3>
<p>Finally, we are laser-focused on Postgres and, as of now, don't support any other databases. This allows us to implement many Postgres-native optimizations. For example, we provide Postgres-native metrics and alerts, including replication slot growth, wait events for logical decoding, number of connections and so on. Features such as parallel snapshotting for 10x faster initial loads and decoding in-flight transactions are in private beta.</p>
<h2 id="heading-try-peerdb-streams">Try PeerDB Streams</h2>
<p>Checkout this 10-minute <a target="_blank" href="https://docs.peerdb.io/quickstart/streams-quickstart">quickstart</a> to try PeerDB for Postgres CDC to Kafka.</p>
<p>Separately, you can try PeerDB through one of three offerings: <a target="_blank" href="https://github.com/PeerDB-io/peerdb">Open Source offering</a>, <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud, our fully managed service</a>, and a self-hosted <a target="_blank" href="https://www.peerdb.io/sign-up">enterprise offering</a> that includes production-grade Helm charts.</p>
<p>Our vision is to provide the world’s best data-movement experience for Postgres. PeerDB Streams is another step in that direction. We built PeerDB Streams in close design partnership with a few Fintech and IoT customers implementing Postgres CDC for their transactional outbox use cases. The product has been battle-tested at scale and is constantly evolving. We would love to get your feedback on product experience, our thesis and anything else that comes to your mind. It would be super useful for us. Thank you!</p>
]]></content:encoded></item><item><title><![CDATA[PeerDB Launch Week]]></title><description><![CDATA[It will be almost 1 year since PeerDB started the YC Summer 23 program. To celebrate this, we challenged ourselves: how could we make a truly impactful announcement? 🤔
The answer wasn't one feature, but an entire week bursting with launches! 🚀
Even...]]></description><link>https://blog.peerdb.io/peerdb-launch-week-1</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-launch-week-1</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[PeerDB]]></category><category><![CDATA[launch week]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Kaushik Iska]]></dc:creator><pubDate>Fri, 03 May 2024 17:31:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1714756659295/c88c9289-ceb6-4352-b07c-d1ea65b6e4f8.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It will be almost 1 year since PeerDB started the YC Summer 23 program. To celebrate this, we challenged ourselves: how could we make a truly impactful announcement? 🤔</p>
<p>The answer wasn't one feature, but an entire week bursting with launches! 🚀</p>
<p>Even with our small team, we knew this was ambitious. But after months of relentless work, we're thrilled to announce...</p>
<p>PeerDB Launch Week begins on <strong>Monday, May 6th</strong>! ✨</p>
<h3 id="heading-what-to-expect">What to expect?</h3>
<p>We don't want to spoil the surprise, so for now we are giving you a teaser of what to expect. Follow us on twitter / X to keep up to date: <a target="_blank" href="https://twitter.com/PeerDBInc">@PeerDBInc</a> to keep up.</p>
]]></content:encoded></item><item><title><![CDATA[Simple Postgres to ClickHouse replication featuring MinIO]]></title><description><![CDATA[At PeerDB, we provide a fast and cost-effective way to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse, and queues like Kafka, Red Panda and Google PubSub, among others.
A few months ago, we added a ClickHouse ...]]></description><link>https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio</link><guid isPermaLink="true">https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio</guid><category><![CDATA[ClickHouse]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[#minio]]></category><category><![CDATA[S3]]></category><category><![CDATA[S3-bucket]]></category><category><![CDATA[miniobucket]]></category><category><![CDATA[ETL]]></category><category><![CDATA[replication]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[change data capture]]></category><category><![CDATA[postgres]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Thu, 02 May 2024 17:55:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1714627263834/5b1077b4-24e0-44b1-bfd8-5f9faf731f8c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, we provide a fast and cost-effective way to replicate data from <a target="_blank" href="https://www.postgresql.org/docs/">Postgres</a> to Data Warehouses such as <a target="_blank" href="https://www.snowflake.com/en/">Snowflake</a>, <a target="_blank" href="https://cloud.google.com/bigquery?utm_source=google&amp;utm_medium=cpc&amp;utm_campaign=na-US-all-en-dr-bkws-all-all-trial-e-dr-1707554&amp;utm_content=text-ad-none-any-DEV_c-CRE_665665924750-ADGP_Hybrid+%7C+BKWS+-+MIX+%7C+Txt-Data+Analytics-BigQuery-KWID_43700077225652815-kwd-47616965283&amp;utm_term=KW_bigquery-ST_bigquery&amp;gad_source=1&amp;gclid=CjwKCAjw88yxBhBWEiwA7cm6pVY6no_rLOIpdwp02v4Oa3S5UbPpKAHIFybbrJH-X3nX5PHS23brUBoCU0oQAvD_BwE&amp;gclsrc=aw.ds&amp;hl=en">BigQuery</a>, <a target="_blank" href="https://clickhouse.com/">ClickHouse</a>, and queues like Kafka, Red Panda and Google PubSub, among others.</p>
<p>A few months ago, we added a <a target="_blank" href="https://blog.peerdb.io/postgres-to-clickhouse-real-time-replication-using-peerdb">ClickHouse connector</a> for Postgres Change Data Capture (CDC). Surprisingly, this connector gained substantial traction and adoption within our community. This applies to both our fully managed service (PeerDB Cloud) and our Open Source offerings. Here is a <a target="_blank" href="https://www.peerdb.io/customers/peerdb-fiber-ai-customer-story">customer story</a> from one of our customers who uses the ClickHouse connector.</p>
<h2 id="heading-the-problem">The Problem</h2>
<p>However, there was one common piece of feedback from many of our Open Source users. The ClickHouse connector required an S3 bucket as a prerequisite, which added additional overhead for users. Non-AWS users and those without immediate access to S3 could not use the ClickHouse connector. This wasn't a problem in our fully managed offering (<a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a>), as we abstracted away the S3 bucket creation from our customers.</p>
<p>This blog describes how we solved this problem and made it extremely easy for our users replicating data from Postgres to ClickHouse. We used <a target="_blank" href="https://min.io/">MinIO</a>, the open source S3 alternative, to stage the intermediary Avro files as part of the Change Data Capture (CDC) from Postgres to ClickHouse.</p>
<h2 id="heading-why-does-the-clickhouse-connector-need-s3">Why does the ClickHouse connector need S3?</h2>
<p>Under the hood, PeerDB uses the <a target="_blank" href="https://blog.peerdb.io/moving-a-billion-postgres-rows-on-a-100-budget#heading-data-in-transit">Avro format</a> for data in transit while replicating data from Postgres to Data Warehouses. Loading Avro files through Go wasn't trivial as the clickhouse-go driver didn't support Avro ingestion. Additionally, ClickHouse has native integration for loading data from S3 and is very efficient at it, as it attempts to parallelize as much work as possible, processing files in a streaming fashion. Therefore, we chose to use S3 as an intermediary storage for Avro files before importing them into ClickHouse.</p>
<p>This method has proven effective, allowing <a target="_blank" href="https://www.peerdb.io/customers/peerdb-fiber-ai-customer-story">users</a> to efficiently replicate data from Postgres to ClickHouse with latencies under 30 seconds and high throughput rates.</p>
<h2 id="heading-minio-helps-make-the-peerdbs-clickhouse-connector-seamless">MinIO helps make the PeerDB's ClickHouse Connector Seamless</h2>
<p>By integrating MinIO container services into our <a target="_blank" href="https://github.com/PeerDB-io/peerdb/blob/main/docker-compose.yml#L189">Docker Compose files</a> for our Open Source offering, we've enabled an in-house S3-compatible storage solution that launches seamlessly with PeerDB. PeerDB uses <a target="_blank" href="https://github.com/PeerDB-io/peerdb/blob/main/docker-compose.yml#L4C3-L4C54">environment variables</a> to manage S3 bucket credentials, allowing for easy integration. Users can set these variables to match the MinIO bucket parameters, or they can plug in their own S3 bucket details. These parameters <a target="_blank" href="https://github.com/PeerDB-io/peerdb/blob/main/docker-compose.yml#L4C3-L4C54">default</a> to the packaged MinIO bucket parameters, as a result, users no longer need to provide a separate bucket for PeerDB’s ClickHouse integration, simplifying the setup process significantly.</p>
<p>A huge shoutout to MinIO for building a solid product that serves as an open source alternative to S3. Integrating MinIO's Docker container within PeerDB's Docker file was a one-week project. MinIO's APIs, being fully compatible with S3, allowed for seamless integration with PeerDB and ClickHouse.</p>
<h2 id="heading-result-even-simpler-postgres-to-clickhouse-replication-with-peerdb">Result: Even simpler Postgres to ClickHouse replication with PeerDB.</h2>
<h3 id="heading-simplifying-clickhouse-peer-creation-with-optional-s3-configuration">Simplifying ClickHouse Peer Creation with Optional S3 Configuration</h3>
<p>Integrating the MinIO Docker Container in our Open Source offering eliminates the need for users to specify S3 buckets to use our ClickHouse connector. While creating the <a target="_blank" href="https://docs.peerdb.io/connect/postgres/rds_postgres">ClickHouse Peer</a>, adding S3 information is optional, as shown in the screenshot below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714519914016/13d54fe7-12bc-4e9c-a1b7-f2274343f781.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-set-up-a-postgres-to-clickhouse-mirror-in-under-a-minute">Set Up a Postgres to ClickHouse Mirror in Under a Minute</h3>
<p>Once the <a target="_blank" href="https://docs.peerdb.io/connect/postgres/rds_postgres">Postgres</a> and <a target="_blank" href="https://docs.peerdb.io/connect/clickhouse">ClickHouse Peers</a> are created, users can create MIRRORs to replicate data from Postgres to ClickHouse within a minute. See below video:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/fa1afec884724876a63aab522b40e445?sid=7d5383ed-0c51-4018-8920-3d8e95ad4c56">https://www.loom.com/share/fa1afec884724876a63aab522b40e445?sid=7d5383ed-0c51-4018-8920-3d8e95ad4c56</a></div>
<p> </p>
<h3 id="heading-use-the-minio-console-for-complete-visibility-into-internal-staging">Use the MinIO Console for complete visibility into internal staging</h3>
<p>MinIO also comes with a sleek UI that helps you manage the internal Avro files PeerDB creates as part of the replication process.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/b41d3ad81259407f9a99b7a74c8f1449?sid=2766cebd-20d9-493d-a95a-b8852c9c30b9">https://www.loom.com/share/b41d3ad81259407f9a99b7a74c8f1449?sid=2766cebd-20d9-493d-a95a-b8852c9c30b9</a></div>
<p> </p>
<p>We hope you enjoyed reading the blog. If you're a ClickHouse user and wish to replicate data from Postgres to ClickHouse using PeerDB, please check out the links below or reach out to us directly!</p>
<ol>
<li><p><a target="_blank" href="https://docs.peerdb.io/mirror/cdc-pg-clickhouse"><strong>Docs on Postgres to ClickHouse Replication.</strong></a></p>
</li>
<li><p><a target="_blank" href="https://app.peerdb.cloud/"><strong>Try PeerDB Cloud for free.</strong></a></p>
</li>
<li><p><a target="_blank" href="https://app.peerdb.cloud/"><strong>Visit PeerDB's Gi</strong></a><a target="_blank" href="https://github.com/PeerDB-io/peerdb"><strong>tHub r</strong></a>epo<a target="_blank" href="https://app.peerdb.cloud/"><strong>sitory to Get Started</strong></a><a target="_blank" href="https://github.com/PeerDB-io/peerdb"><strong>.</strong></a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[How can we make pg_dump and pg_restore 5 times faster?]]></title><description><![CDATA[pg_dump and pg_restore are reliable tools for backing up and restoring Postgres databases. They're essential for database migrations, disaster recovery and so on. They offer precise control over object selection for backup/restore, dump format option...]]></description><link>https://blog.peerdb.io/how-can-we-make-pgdump-and-pgrestore-5-times-faster</link><guid isPermaLink="true">https://blog.peerdb.io/how-can-we-make-pgdump-and-pgrestore-5-times-faster</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[postgres]]></category><category><![CDATA[Databases]]></category><category><![CDATA[migration]]></category><category><![CDATA[ETL]]></category><category><![CDATA[replication]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Thu, 25 Apr 2024 16:16:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1714061341343/dc949bbd-96be-4479-88ae-95e33ac2fc48.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://www.postgresql.org/docs/current/app-pgdump.html">pg_dump</a> and <a target="_blank" href="https://www.postgresql.org/docs/current/app-pgrestore.html">pg_restore</a> are reliable tools for backing up and restoring <a target="_blank" href="https://www.postgresql.org/">Postgres</a> databases. They're essential for database migrations, disaster recovery and so on. They offer precise control over object selection for backup/restore, dump format options (plain or compressed), parallel table processing and so on. They ensure a consistent database snapshot is dumped and restored.</p>
<p>However, they are single-threaded at the table level. This significantly slows down the dump and restore of databases with a star schema common in real-world applications such as Time series and IoT. For databases over 1 TB, <code>pg_dump</code> and <code>pg_restore</code> can take days, increasing downtime during migrations and RTOs in disaster recovery scenarios.</p>
<p>In this blog, we'll discuss an idea called <strong>"Parallel Snapshotting"</strong>. This idea could be integrated into Postgres upstream in the future to make <code>pg_dump</code> and <code>pg_restore</code> parallelizable at a single table level. Parallel Snapshotting has already been implemented in <a target="_blank" href="https://github.com/PeerDB-io/peerdb">PeerDB</a>, an open-source Postgres replication tool. We will also cover a few interesting benchmarks of migrating a large table of 1.5TB from one Postgres Database to another with and without Parallel Snapshotting.</p>
<h2 id="heading-a-quick-primer-on-pgdump-and-pgrestore">A quick primer on pg_dump and pg_restore</h2>
<p><code>pg_dump</code> is the most reliable way to back up a PostgreSQL database. It enables the backup of a database at a consistent snapshot; that is, the backup guarantees a state that existed previously. The backup generated by <code>pg_dump</code> is a logical representation of the data in PostgreSQL, not a copy of the PostgreSQL data directory. It captures objects as they appear in PostgreSQL.</p>
<p><code>pg_restore</code> is the most reliable way to restore a backup generated by pg_dump from one PostgreSQL database to another.</p>
<p>Both <code>pg_dump</code> and <code>pg_restore</code> are Postgres-native; that is, they come packaged with community Postgres and can be used as command-line utilities, similar to <a target="_blank" href="https://www.postgresql.org/docs/current/app-psql.html">psql</a>.</p>
<h3 id="heading-pgdump-and-pgrestore-offer-fine-grain-control">pg_dump and pg_restore offer fine grain control</h3>
<p>They provide fine-grained control to manage the backup and restore processes. Below are a few flags that are commonly used:</p>
<ol>
<li><p>You have the <code>-f</code> flag, which lets you decide on data formats such as plain text or compressed gzip. Compressed dumps are quite helpful when you have limited network bandwidth or want to save on network costs.</p>
</li>
<li><p>To speed up the backup and restore process, you can use the <code>-j</code> flag to dump and restore multiple tables in parallel.</p>
</li>
<li><p>You can pick and choose specific database objects you want to backup and restore, including tables and schemas.</p>
</li>
<li><p>You can also choose to dump only the schema or only the data using the <code>schema-only</code> and <code>data-only</code> flags.</p>
</li>
<li><p>There are many more flags that they provide that can be found in community <a target="_blank" href="https://www.postgresql.org/docs/current/app-pgdump.html">docs</a>.</p>
</li>
</ol>
<h3 id="heading-pgdump-and-pgrestore-can-be-very-slow-for-large-tables">pg_dump and pg_restore can be very slow for large tables</h3>
<p><strong>pg_dump and pg_restore are single threaded at a table level</strong></p>
<p>There is a painful issue that users often run into with <code>pg_dump</code> and <code>pg_restore</code>. <code>pg_dump</code> and <code>pg_restore</code> can be very slow for large tables. This is because they are single threaded at table level. They can dump and restore multiple tables in parallel but for a single table they are single threaded.</p>
<p>This means that in use cases where you have a single fact table and multiple dimension tables, <code>pg_dump</code> and <code>pg_restore</code> can get bottlenecked on the large fact table. This is very common in the star schema data-model which is used by multiple real-world use-cases such as IoT, Timeseries, Data Warehousing and so on.</p>
<h3 id="heading-migrating-a-15tb-table-can-take-15-days">Migrating a 1.5TB table can take 1.5 days</h3>
<p>The impact of the problem described above can be significant. Using <code>pg_dump</code> and <code>pg_restore</code> to migrate a 1.5 TB <a target="_blank" href="https://www.postgresql.org/docs/current/pgbench.html">pgbench_accounts</a> table from one Postgres database to another took 1.5 days. The benchmark was conducted under optimal conditions, i.e., using the correct flags and region collocating the source, target, and the VM on which <code>pg_dump</code> and <code>pg_restore</code> were running, among other factors. This 1.5-day downtime is substantial when migrating or recovering mission-critical databases.</p>
<h2 id="heading-parallel-snapshotting-to-make-pgdump-amp-pgrestore-multi-threaded-per-table">Parallel Snapshotting to make pg_dump &amp; pg_restore multi-threaded per table</h2>
<p>Now, let's explore a concept called Parallel Snapshotting, which could make <code>pg_dump</code> and <code>pg_restore</code> multi-threaded at the single table level. Note that Parallel Snapshotting is not currently implemented in the upstream versions of <code>pg_dump</code> and <code>pg_restore</code>. It represents an idea/design that could enhance <code>pg_dump</code> and <code>pg_restore</code> in the future.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/5VbJcSsK0OM">https://youtu.be/5VbJcSsK0OM</a></div>
<p> </p>
<p>Below video captures migrating 5GB of data from Postgres to Postgres within a min using the Parallel Snapshotting feature in PeerDB.</p>
<h3 id="heading-ctid-forms-the-basis-of-parallel-snapshotting">CTID forms the basis of Parallel Snapshotting</h3>
<p><a target="_blank" href="https://www.postgresql.org/docs/current/ddl-system-columns.html#:~:text=ctid">CTID</a> forms the basis of Parallel Snapshotting. Every row in a Postgres table has an internal column called CTID, also known as the tuple identifier. CTID is unique for each row of the table. It represents the exact location of the row on disk—it is the combination of the page/block number and the page offset. You can also query the CTID column for a table through a simple SELECT as you are seeing in the below image.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714054057800/03e06d8e-2368-453b-b8c0-58d7fbe16ac4.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-parallel-snapshotting-logically-partition-the-table-by-ctid-and-copy-multiple-partitions-simultaneously">Parallel Snapshotting - Logically Partition the Table by CTID and COPY Multiple Partitions Simultaneously</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1714054167383/34f72a09-a6bb-48c2-813e-0b2eb48a467d.png" alt class="image--center mx-auto" /></p>
<p>Let's dive into the design of Parallel Snapshotting:</p>
<ol>
<li><p>First, create a Postgres Snapshot using the function <code>pg_export_snapshot()</code>. This ensures that the dump and restore operate on a consistent snapshot of the database.</p>
</li>
<li><p>Second, using that snapshot, logically partition the large table based on CTIDs, i.e., create CTID ranges that encapsulate the table.</p>
</li>
<li><p>Once that is done, copy multiple such logical partitions in parallel from the source to the target.</p>
<ol>
<li><p>Essentially, you are running SELECT statements with these CTID ranges to read data from the source and write it to the target.</p>
</li>
<li><p>The SELECT statements with CTID ranges are very efficient because they use tid range scans, which are similar to index lookups on the CTID column.</p>
</li>
<li><p>Also, note that you are reading data in the order of how it is stored on the disk.</p>
</li>
</ol>
</li>
<li><p>We are using <code>COPY WITH BINARY</code> to <code>STDOUT</code> and from <code>STDIN</code>, which makes the dump and restore simultaneous.</p>
</li>
<li><p>We are also using cursors to ensure that the dump doesn’t exhaust memory.</p>
</li>
</ol>
<h3 id="heading-migrating-a-15tb-table-5-times-faster-with-parallel-snapshotting">Migrating a 1.5TB table 5 times faster with Parallel Snapshotting</h3>
<p>At PeerDB, we are building a Postgres replication tool to provide a fast and cost-effective way to move data from Postgres to data warehouses such as <a target="_blank" href="https://www.snowflake.com/en/">Snowflake</a>, BigQuery, <a target="_blank" href="https://clickhouse.com/">ClickHouse</a>, <a target="_blank" href="https://www.postgresql.org/">PostgreSQL</a>, and queues like <a target="_blank" href="https://kafka.apache.org/">Kafka</a>, <a target="_blank" href="https://redpanda.com/">Redpanda</a>, <a target="_blank" href="https://cloud.google.com/pubsub?hl=en">Google PubSub</a>, <a target="_blank" href="https://azure.microsoft.com/en-us/products/event-hubs">Azure Event Hubs</a>, etc.</p>
<p>To enable faster migrations from one Postgres database to another, we have implemented Parallel Snapshotting within our product. Through this feature, our customers are able to move terabytes of data in a few hours versus days.</p>
<p>We did the same above benchmark to move a 1.5 TB <a target="_blank" href="https://www.postgresql.org/docs/current/pgbench.html">pgbench_accounts</a> table from one Postgres database to another, and it took just 7 hours with PeerDB. This was 5x faster than using <code>pg_dump</code> and <code>pg_restore</code>. This speedup was possible through the Parallel Snapshotting feature. The performance can be further improved by increasing the number of parallel threads for the migration and by using more beefier Postgres source and target databases.</p>
<h2 id="heading-conclusion-and-references">Conclusion and References</h2>
<p>The intent of this blog is to share the design principles we followed to enable faster database migrations and discuss how they can be extended to enhance <code>pg_dump</code> and <code>pg_restore</code> in the future. Hope you enjoyed reading the blog. Sharing a few relevant links for reference:</p>
<ol>
<li><p><a target="_blank" href="https://tech.gadventures.com/speeding-up-postgres-restores-de575149d17a">Speeding up Postgres restores</a></p>
</li>
<li><p><a target="_blank" href="https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/faster-data-migrations-in-postgres/ba-p/2150850">Faster Data Migrations in Postgres</a></p>
</li>
<li><p><a target="_blank" href="https://blog.peerdb.io/faster-postgres-migrations-using-peerdb-part-1">Faster Postgres Migrations using PeerDB</a></p>
</li>
<li><p><a target="_blank" href="https://www.youtube.com/watch?v=pgJwT9vcwI8">Podcast on Logical replication common issues</a></p>
</li>
<li><p>Try <a target="_blank" href="https://github.com/PeerDB-io/peerdb">PeerDB Open Source</a> for fast Postgres migration and replication</p>
</li>
<li><p>Try <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a>, the fully managed offering of PeerDB</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[PeerDB raises $3.6 million seed funding to revolutionize data movement for PostgreSQL]]></title><description><![CDATA[PeerDB offers a fast and cost-effective way to move data from PostgreSQL to data warehouses, such as Snowflake, and to queues like Kafka. This enables businesses to have real-time and reliable access to data, which is of utmost importance in this AI ...]]></description><link>https://blog.peerdb.io/peerdb-raises-3-6-million-seed-funding-to-revolutionize-data-movement-for-postgresql</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-raises-3-6-million-seed-funding-to-revolutionize-data-movement-for-postgresql</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[ETL]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[devtools]]></category><category><![CDATA[enterprise]]></category><category><![CDATA[funding]]></category><category><![CDATA[Investment]]></category><category><![CDATA[Venture Capital]]></category><category><![CDATA[replication]]></category><category><![CDATA[data]]></category><category><![CDATA[postgres]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Thu, 11 Apr 2024 05:55:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712769385469/0b629288-e7b5-47b9-af42-4894cfe327b4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>PeerDB offers a fast and cost-effective way to move data from PostgreSQL to data warehouses, such as Snowflake, and to queues like Kafka. This enables businesses to have real-time and reliable access to data, which is of utmost importance in this AI era. PeerDB’s overarching vision is to become the de facto standard for data movement and ETL (extract, transform and load) for companies that run their businesses on Postgres.</p>
<p><strong>SAN FRANCISCO — April 11, 2024 —</strong><a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, the leading data movement platform for PostgreSQL, today announced it received $3.6 million in seed round funding. Investors in the round include lead investor <a target="_blank" href="https://www.8vc.com/">8VC</a>, <a target="_blank" href="https://www.ycombinator.com/companies/peerdb">Y Combinator</a>, <a target="_blank" href="https://www.wayfinder.com/">Wayfinder Ventures</a>, <a target="_blank" href="https://winfunding.com/">Webb Investment Network</a>, <a target="_blank" href="https://www.flexcapital.com/">Flex Capital</a>, <a target="_blank" href="https://rogue.capital/">Rogue Capital</a>, <a target="_blank" href="https://www.pioneerfund.vc/">Pioneer Fund</a>, <a target="_blank" href="https://www.orangecollective.vc/">Orange Collective</a> and several angel investors.</p>
<p>PeerDB will use the funds to continue building its engineering team, propelling its go-to-market and client acquisition initiatives and supporting its growth. PeerDB revenue is doubling every two months.</p>
<p><em>"Postgres is becoming the database of the world and the de facto primary database for both enterprises and SMBs. Existing data movement and ETL tools are not built for Postgres – they often fail at scale due to painfully slow syncs, lack of reliability and lack of native features. The time has come for someone to give enough care to the world's most adopted open source database. Thanks to all our investors and customers for believing in us and sharing our vision,"</em> <strong>said PeerDB CEO and co-founder Sai Krishna Srirampur.</strong></p>
<p><em>"At PeerDB, we're tackling inefficiencies surrounding Postgres data movement. With fundamental optimizations like parallel snapshotting and handling the nuances around replication slots that are absent in the existing ETL tools, we’re focused on building a system specialized for Postgres at terabyte scale – our approach diverges from traditional methods that falter at scale and resort to resyncs. Our team of Postgres experts not only provides data movement services but also becomes an essential part of client teams, offering advice on database tuning and query optimization. As we brace for the influx of data driven by the era of LLMs, we're committed to ensuring Postgres data movement remains efficient and scalable,"</em> <strong>said PeerDB CTO and co-founder Kaushik Iska.</strong></p>
<h2 id="heading-the-problem">The Problem</h2>
<p>In the current data and AI landscape, existing data movement and ETL (extract, transform, and load) tools prioritize the breadth over the quality of connectors and are not optimized for Postgres. Users often face issues such as painfully slow syncs — syncing hundreds of gigabytes of data can take days; unreliability, characterized by frequent crashes and loss of data precision; and feature limitations, including a lack of configurability and support for native data types.  </p>
<p>PeerDB distinguishes itself by prioritizing the quality over the breadth of connectors and tailoring its design specifically for Postgres. Through this, PeerDB offers 10 times faster data movement and at one-fifth the cost.</p>
<h2 id="heading-why-now">Why Now?</h2>
<p><em>"8VC's investment in PeerDB is driven by our conviction in the exponential growth trajectory of Postgres. We see an exceptional team in Sai and Kaushik, whose deep expertise in Postgres positions them uniquely in the marketplace. We believe that a solid narrative around data movement and ETL will play a pivotal role in the success and widespread adoption of Postgres,"</em> said <a target="_blank" href="https://www.8vc.com/team/bhaskar-ghosh"><strong>Bhaskar Ghosh</strong></a><strong>, partner at 8VC</strong>.</p>
<p>The time is ripe to develop a first-class data movement tool for Postgres, as it is becoming the world's most popular database. Currently ranked fourth in the <a target="_blank" href="https://db-engines.com/en/ranking">DB-Engines Ranking</a> of database management systems (DBMS) based on popularity, Postgres is the only DBMS in the top four experiencing growth. It also received DB-Engines' <a target="_blank" href="https://db-engines.com/en/blog_post/106">DBMS of the Year 2023</a> award for gaining more popularity than any other of the 417 monitored systems in 2023. Enterprise adoption of Postgres is on the rise, <a target="_blank" href="https://www.techtarget.com/searchdatamanagement/news/252472281/PostgreSQL-12-boosts-open-source-database-performance">with 50% of enterprise companies</a> already using it. This trend is set to significantly increase the volume of data stored and moved through Postgres, requiring an ETL tool that can handle this scale.</p>
<p>The introduction and funding of PeerDB is also timely due to the widespread focus on artificial intelligence (AI) and hyperscale data analytics, which often require movement of massive datasets from primary database platforms such as Postgres to data warehouses for AI-based analytics to help provide insights and inform business decisions.</p>
<h2 id="heading-peerdbs-vision-and-use-cases">PeerDB's Vision and Use cases</h2>
<p>PeerDB's vision is to become the de facto standard for data movement and ETL for companies that run their businesses on Postgres, encompassing use cases such as:</p>
<ul>
<li><p><strong>Fast and cost-effective replication to data warehouses:</strong> Replicate data from Postgres to analytical stores (<a target="_blank" href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>) such as Snowflake, BigQuery and ClickHouse for AI-based analytics informing business decisions in use cases like fraud or anomaly detection. PeerDB has already made its mark here with a rapidly growing <a target="_blank" href="https://www.peerdb.io/customers">customer base</a> that strongly challenges incumbents like Fivetran.</p>
</li>
<li><p><strong>Real-time streaming and change data capture:</strong> Low-latency replication from Postgres to queues, such as Kafka, enabling use cases like real-time alerting and micro services-based architectures. PeerDB already supports Kafka, Azure Event Hubs, and Google PubSub as targets, serving as an enterprise-grade alternative to Debezium.</p>
</li>
<li><p><strong>Database migrations:</strong> Migrating data from legacy databases like Oracle and SQL to Postgres to support modernization and digital transformation initiatives.</p>
</li>
<li><p><strong>Enterprise-grade Postgres high availability (HA) and backups</strong>: As enterprises modernize their database stack by migrating from Oracle and SQL Server to Postgres, managing HA and backups across regions and hybrid on-premises environments becomes critical. The infrastructure that PeerDB is developing can be extended to support such mission-critical use cases in the future.</p>
</li>
<li><p><strong>Vector ETL:</strong> Extracting unstructured data at scale, transforming it into vector embeddings with LLMs, and loading these into Postgres using <a target="_blank" href="https://github.com/pgvector/pgvector">pgvector</a>. This enables semantic searches for advanced AI applications.</p>
</li>
</ul>
<h2 id="heading-peerdb-is-built-for-postgres">PeerDB is built for Postgres</h2>
<p>Below are a few product differentiators of PeerDB:</p>
<ul>
<li><p><strong>Faster Postgres data movement:</strong> PeerDB implements <a target="_blank" href="https://blog.peerdb.io/parallelized-initial-load-for-cdc-based-streaming-from-postgres#heading-parallelized-initial-snapshot-for-cdc-based-streaming">parallel snapshotting</a>, the fastest way for moving Postgres data. This can dramatically reduce the time required to move massive datasets, often from days to hours, while ensuring consistency.</p>
</li>
<li><p><strong>Native Postgres data type support and replication:</strong> PeerDB specializes in natively replicating advanced data types like JSONB and geospatial, crucial for IoT apps and geospatial applications. With data available in native formats, users save the time and effort required for data transformation, since the data is already in the format necessary for their AI-based analytics and other applications.</p>
</li>
<li><p><strong>Cost optimizations:</strong> PeerDB can reduce data movement costs by up to five times compared to incumbent ETL tools. PeerDB published a <a target="_blank" href="https://blog.peerdb.io/moving-a-billion-postgres-rows-on-a-100-budget">white paper</a> detailing the data modeling and infrastructure optimizations employed to save costs for customers.</p>
</li>
</ul>
<h2 id="heading-customers">Customers</h2>
<p><a target="_blank" href="https://www.peerdb.io/customers">PeerDB customers</a> include:</p>
<ul>
<li><p><a target="_blank" href="https://www.peerdb.io/customers/harmonic-customer-story">Harmonic AI</a> replaced an incumbent ETL solution with PeerDB, saving $80,000 yearly and reducing their yearly data-movement costs by five times.</p>
</li>
<li><p><a target="_blank" href="https://www.peerdb.io/customers/expedock-customer-story">Expedock</a> uses PeerDB to deliver real-time AI-driven supply chain automation, replicating 700 million rows monthly from Postgres to Snowflake with under one minute latency and five times cost savings compared to their previous ETL tool.</p>
</li>
<li><p><a target="_blank" href="https://www.peerdb.io/customers/peerdb-fiber-ai-customer-story">Fiber AI</a> uses PeerDB to replicate terabytes of data from Postgres to ClickHouse in real time, powering their real-time search use case.</p>
</li>
<li><p><a target="_blank" href="https://www.peerdb.io/customers/peerdb-flatiron-health-customer-story">Flatiron Health</a> used PeerDB to migrate 35,000 tables with terabytes of data from Postgres to Snowflake within a week.</p>
</li>
</ul>
<p><em>"We’re using PeerDB already for our Postgres to ClickHouse ETL and it’s insanely fast and accurate! We can’t believe how well it works. The PeerDB team has been super helpful in getting us set up, helping us debug, and advising us on everything related to ClickHouse and Postgres. Great work, guys!"**</em>said Neel Mehta, CTO, Fiber AI.**</p>
<h2 id="heading-how-to-try-peerdb">How to try PeerDB?</h2>
<p>You can try PeerDB through one of <a target="_blank" href="https://www.peerdb.io/#prices">three offerings</a>: <a target="_blank" href="https://github.com/PeerDB-io/peerdb">Open source</a>, <a target="_blank" href="https://app.peerdb.cloud/">a fully managed cloud service</a> and a <a target="_blank" href="https://www.peerdb.io/sign-up">self-hosted enterprise offering</a>.</p>
<h2 id="heading-founders">Founders</h2>
<p>The co-founders of PeerDB, CEO <a target="_blank" href="https://www.linkedin.com/in/sai-krishna-srirampur-1741b019/">Sai Krishna Srirampur</a> and CTO <a target="_blank" href="https://www.linkedin.com/in/kaushikiska/">Kaushik Iska</a>, have been friends since high school and were roommates in college at the <a target="_blank" href="https://www.iiit.ac.in/">International Institute of Information Technology Hyderabad</a>, where they both studied computer science. Sai Krishna Srirampur was an early engineer at Citus Data, which was acquired by Microsoft. There, he led solutions engineering for all Postgres services on Microsoft Azure. Kaushik Iska built operating systems and led data teams at Google, SafeGraph, and Palantir Technologies. He also represented India in the International Collegiate Programming Contest (ACM ICPC) World Finals. They have been building Postgres products for a decade now and have closely worked with Postgres customers running into issues with existing ETL tools. To fill this gap, they founded PeerDB. Below is the image of the PeerDB Founding Team.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712814146601/56d6fd3c-9065-421c-b5a0-ef9f34396e2b.png" alt="PeerDB Team" class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[PeerDB is GDPR Compliant]]></title><description><![CDATA[We are excited to share a significant achievement at PeerDB: we have achieved full compliance with the General Data Protection Regulation (GDPR). This milestone represents our unwavering dedication to data protection and privacy, further strengthenin...]]></description><link>https://blog.peerdb.io/peerdb-is-gdpr-compliant</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-is-gdpr-compliant</guid><category><![CDATA[#gdpr]]></category><category><![CDATA[Security]]></category><category><![CDATA[compliance ]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[snowflake]]></category><category><![CDATA[bigquery]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[replication]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Kunal Gupta]]></dc:creator><pubDate>Thu, 04 Apr 2024 15:37:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712245065079/ca2c1d6c-0389-4d97-8805-44180c6a9485.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We are excited to share a significant achievement at <a target="_blank" href="https://peerdb.io/">PeerDB</a>: we have achieved full compliance with the <a target="_blank" href="https://gdpr-info.eu/">General Data Protection Regulation (GDPR)</a>. This milestone represents our unwavering dedication to data protection and privacy, further strengthening the trust our clients and partners have placed in us.</p>
<p><strong>Our Ongoing Dedication to Data Privacy and Security</strong></p>
<p>At PeerDB, we understand the critical importance of data privacy and security. Our team has worked diligently to ensure that our practices align with the stringent requirements of the GDPR. For a detailed overview of our security measures, we invite you to explore our updated <a target="_blank" href="https://trust.peerdb.io/">Trust Center</a>.</p>
<p><strong>Benefits of GDPR Compliance for Our Clients</strong></p>
<ul>
<li><p><strong>Enhanced Data Security:</strong> With GDPR compliance in place, we guarantee the highest level of security and confidentiality for all data entrusted to us. From data encryption to strict access controls, we have implemented comprehensive measures to safeguard your information.</p>
</li>
<li><p><strong>Transparency and Control:</strong> <a target="_blank" href="https://trust.peerdb.io/">Our Trust Center</a> serves as a central hub for information on our security infrastructure, organizational practices, and third-party engagements. We believe in transparency and empower our clients with the control they need over their data.</p>
</li>
<li><p><strong>Continual Improvement:</strong> Our commitment to data security doesn't end with GDPR compliance. We are dedicated to continuously evaluating and enhancing our security measures to not only meet current regulations but also stay ahead of emerging threats.</p>
</li>
</ul>
<p><strong>Looking Ahead: Our Next Steps in Security</strong></p>
<p>While we celebrate our GDPR compliance, we are also looking towards the future. As part of our ongoing commitment to security excellence, we are currently in the process of preparing for the SOC 2 Type 2 certification and are undergoing the audit for the same. This additional certification will further validate our dedication to maintaining high security standards for our services.</p>
<p><strong>Empowering Your Business with PeerDB Cloud</strong></p>
<p><a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> provides a secure and scalable platform for businesses to fulfil all their Postgres Data Movement needs. With GDPR compliance at its core, <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a> offers peace of mind, knowing that your data is protected in a robust and reliable environment.</p>
<p><strong>Why GDPR Compliance Matters</strong></p>
<p>Achieving GDPR compliance goes beyond meeting legal requirements—it's about building trust with our clients and exceeding industry standards. It demonstrates our commitment to ensuring the privacy and security of the data we handle.</p>
<p><strong>Your Trust, Our Priority</strong></p>
<p>At PeerDB, we are dedicated to being a reliable partner in your digital journey. We are here to assist you in leveraging our services with confidence, knowing that your data is protected. For more information on our GDPR compliance efforts, <a target="_blank" href="https://app.peerdb.cloud/">PeerDB Cloud</a>, or any other security-related inquiries, please visit our <a target="_blank" href="https://trust.peerdb.io/">Trust Center</a>.</p>
<p>Thank you for being part of this journey with us. We look forward to continuing to provide secure and trusted solutions for all your data movement needs.</p>
]]></content:encoded></item><item><title><![CDATA[Exploring versions of the Postgres logical replication protocol]]></title><description><![CDATA[Introduction
Logical Replication is one of the many ways a Postgres database can replicate data to other Postgres database (a.k.a standby). Logical replication directly reads from the write-ahead log (WAL), recording every database change, avoiding t...]]></description><link>https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol</link><guid isPermaLink="true">https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[postgres]]></category><category><![CDATA[replication]]></category><category><![CDATA[high availability]]></category><dc:creator><![CDATA[Kevin Biju]]></dc:creator><pubDate>Mon, 01 Apr 2024 16:57:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711920289167/ed90c561-532b-4fea-8eb9-9ef617f7337a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction"><strong>Introduction</strong></h2>
<p><a target="_blank" href="https://www.postgresql.org/docs/current/logical-replication.html">Logical Replication</a> is one of the <a target="_blank" href="https://www.postgresql.org/docs/16/different-replication-solutions.html">many</a> ways a Postgres database can replicate data to other Postgres database (a.k.a standby). Logical replication directly reads from the <a target="_blank" href="https://www.postgresql.org/docs/current/wal-intro.html">write-ahead log</a> (WAL), recording every database change, avoiding the need to intercept queries or periodically read the table. These changes are filtered, serialized and then sent to the standby servers where they can be applied. While logical replication is intended to be used by Postgres databases to send and receive changes, it also allows ETL tools like <a target="_blank" href="https://www.peerdb.io/">PeerDB</a> to get a reliable stream of changes that can be processed as needed.</p>
<p>Logical replication started by only allowing streaming of committed transactions. It then evolved to support in-flight transactions followed by <a target="_blank" href="https://www.postgresql.org/docs/current/two-phase.html">two-phase commits</a> and then parallel apply of in-flight transactions. This blog will dive into this evolution, its impact on performance, and present some useful benchmarks. This blog is useful for anyone who uses Postgres Logical Replication in practice!</p>
<h2 id="heading-components-of-logical-replication"><strong>Components of logical replication</strong></h2>
<p>For a quick rundown, a full logical replication setup involves several crucial components. <strong>Please skip this section if you are already familiar with the concepts of logical replication.</strong></p>
<p>1. <a target="_blank" href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS"><strong>Replication Slot</strong></a>: A replication slot on the primary server is what reads changes from the WAL and passes it to the output plugin to be serialized and sent to the standby server (or ETL tool) to be applied. Periodically, the standby server sends a message to the primary to confirm that it has read the WAL to a certain point, at which point the slot can advance.</p>
<p>2. <a target="_blank" href="https://www.postgresql.org/docs/current/logical-replication-publication.html"><strong>Publication</strong></a>: A publication is essentially a filter on the WAL changes. Publications are very powerful and can filter out schemas, tables and even particular columns of tables. You can also choose to publish inserts and not updates and also apply custom logic to filter out certain rows. When a standby starts reading from a replication slot, a set of publications are passed as input.</p>
<p>3. <a target="_blank" href="https://www.postgresql.org/docs/current/logical-replication-subscription.html"><strong>Subscriptions</strong></a>: A subscription is basically the Postgres syntax for creating a logical replication connection to a primary server for replicating changes from a slot and a set of publications. The standby then reads data from the primary and replicates it as long as the subscription is active. While this is Postgres specific, other tools end up behaving like subscribed standbys and get the same output from the primary server.</p>
<p>4. <a target="_blank" href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html#LOGICALDECODING-EXPLANATION-OUTPUT-PLUGINS"><strong>Output plugins</strong></a>: The replication slot passes raw WAL change data to an output plugin which serializes it to a stream of messages. This helps with the interoperability of logical replication as the message format is independent of the underlying database version or configuration. The de-facto output plugin is a Postgres project called <code>pgoutput</code> but other plugins like <code>wal2json</code> and <code>decoderbufs</code> enjoy support among the community.</p>
<h2 id="heading-wait-logical-replication-has-versions"><strong>Wait, logical replication has versions?</strong></h2>
<p>When starting logical replication (<a target="_blank" href="https://www.postgresql.org/docs/current/protocol-logical-replication.html">START_REPLICATION</a>), there is a <a target="_blank" href="https://www.postgresql.org/docs/current/protocol-logical-replication.html">parameter</a> called <code>proto_version</code> that allows users to opt in to newer semantics of the logical replication protocol. Starting with Postgres 14 in September 2021, three new <code>proto_version</code>s of logical replication have been added in consecutive releases. Looking at the docs for <code>proto_version</code> right now, we see this:</p>
<pre><code class="lang-plaintext">proto_version
    Protocol version. Currently versions 1, 2, 3, and 4 are supported.

    Version 2 is supported only for server version 14 and above, 
    and it allows streaming of large in-progress transactions.

    Version 3 is supported only for server version 15 and above, 
    and it allows streaming of two-phase commits.

    Version 4 is supported only for server version 16 and above, and it 
    allows streams of large in-progress transactions to be applied 
    in parallel.
</code></pre>
<p>While these all sound like good things, it's not clear for the average reader what they mean or what problems are being tackled. And for the informed reader who knows what these changes mean, it'd still be nice to understand how they are implemented and their impact on real-world workloads.</p>
<h3 id="heading-v1-the-status-quo"><strong>v1 - the status quo</strong></h3>
<p>To analyze the messages and semantics of the various protocol versions, we've written a small Go application called <code>polorex</code>. If you want to check out the code or try things out for yourself, check out the code in <a target="_blank" href="https://github.com/PeerDB-io/polorex">this repo</a>.</p>
<p>To simulate a workload, we are running 2 transactions concurrently, inserting rows into the same table. The transactions insert rows in 100 batches of 250,000, totalling 50 million rows. The workload is simulated by a subcommand of the <code>polorex</code> application. The transactions are read and analyzed by another subcommand called <code>txnreader</code> which connects to the database and continuously reads the replication slot.</p>
<pre><code class="lang-plaintext">./polorex txnreader -port 7132
[in a different terminal]
./polorex txngen -port 7132 -iterations 100 -batchsize 250000 -parallelism 2
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706541313918/7fc71484-2f5c-47a4-9ad5-4c974251dd2c.png" alt class="image--center mx-auto" /></p>
<p>The transactions start at the green line and end at the red line. We can see how the transactions are being read only after they commit. It takes 3-4 minutes to decode both our transactions. Since we just committed 2 large transactions, the <code>pgoutput</code> plugin has to read a lot of WAL at once and then serialize it into 50 million <code>INSERT</code> messages to be sent over. While the graph shows that we are reading almost 250K inserts per second, but one can see how this could quickly go out of hand for larger transactions with wider schemas. We could quickly fall behind the primary server purely due to this decoding overhead.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706541553668/831dc1f1-7a86-4d24-a0ae-89c535f89dba.png" alt class="image--center mx-auto" /></p>
<p>Another issue which follows from this but is less obvious is with regards to the size of the replication slot. This is basically the amount of WAL being retained for the slot to decode changes without losing any data. Looking at the graph, it quickly rises as the transactions progress, <strong>but also stays high until both transactions are read</strong> at which point it falls dramatically. This can be an issue in workloads with high throughput and large transactions - the WAL being retained can reach hundreds of gigabytes within a matter of hours, thereby consuming the entire disk space and crashing the Postgres server.</p>
<p>With this insight in mind, we can see how version 2's promise of allowing <code>streaming of large in-progress transactions</code> sound enticing. But there is also a simplicity in version 1 of only sending changes over when they are committed. We read a <code>BeginMessage</code> and everything from there onward is fair game to be replicated immediately. In contrast, an "in-progress" transaction could be rolled back at any point, and therefore all the changes read so far need to be staged somehow before being replicated.</p>
<h3 id="heading-v2-rows-down-the-stream"><strong>v2 - rows down the stream</strong></h3>
<p>To begin with, we restart <code>txnreader</code> with a flag to ask it to use protocol version 2 while connecting to the slot. We then rerun the same <code>txngen</code> workload.</p>
<pre><code class="lang-plaintext">./polorex txnreader -port 7132 -protocol 2
[in a different terminal]
./polorex txngen -port 7132 -iterations 100 -batchsize 250000 -parallelism 2
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706548856165/1b570457-9d75-45b2-baa2-1b3e48ffaa73.png" alt class="image--center mx-auto" /></p>
<p>We are seeing a completely different story in terms of how the transactions are being processed here. It's clear that we're getting rows way before the transaction even commits. We're actually seeing <code>streaming of in-progress transactions</code>! Rows for a particular transaction come to us between a <code>StreamStartMessage</code> and <code>StreamStopMessage</code>, and we get several of these streams while rows are still being sent over. We are getting streams for both of our transactions before any of them commit, but we are still only reading 1 transaction at a time.</p>
<p>A transaction being streamed now commits using a <code>StreamCommitMessage</code>, but unlike the <code>Commit</code> message from earlier, we <strong>need</strong> to wait for this since the fate of the transaction is not known yet. Alternatively, we could receive a <code>StreamAbortMessage</code> which implies transaction rollback and so all our changes for said transaction should not be applied.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706550671096/fd6abba6-31a0-4380-9cc4-e98207c4e9c6.png" alt class="image--center mx-auto" /></p>
<p>The improvements from streaming are nothing short of dramatic, we can see how transactions are fully read seconds after the rows finish inserting, approximately 4 minutes earlier compared to version 1. As a result, the slot size also decreases much more quickly.</p>
<h3 id="heading-results-v2-enables-faster-decoding-and-shorter-peak-slot-size-duration"><strong>Results - v2 enables faster decoding and shorter peak slot size duration</strong></h3>
<p>To reiterate, there is no magical improvement in transaction reading performance or peak slot size. The transactions themselves take about the same time to process and generate the same amount of WAL, but since the replication happens in parallel with the transaction, we see better performance.</p>
<p>In version 2, transactions are fully decoded, and the slot size decreases immediately after the transactions are completed, compared to version 1, which requires an additional 4 minutes. This can have drastic impact on workloads with high throughput and sizable transactions - version 2 can be very helpful in enhancing logical decoding performance and ensuring the slot size is kept in check!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706568527289/9278eb76-9e32-4916-8b12-0641636edd3a.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-v3-and-v4-2pc-and-parallel-apply"><strong>v3 and v4 - 2PC and parallel apply</strong></h3>
<p>Version 3 introduces new message types to manage <a target="_blank" href="https://www.postgresql.org/docs/current/sql-prepare-transaction.html">two-phase commit</a> transactions. While significant in certain scenarios, the concept of two-phase commit remains relatively niche from an ELT standpoint.</p>
<p>Version 4 is less clear in its description, and even the documentation doesn't venture much farther than this. As it turns out, it doesn't refer to applying multiple transactions in parallel, but spreading out the load of applying a single large transaction over multiple processes in the standby. For this, new fields have been added to some existing messages. This is again a great feature in some workloads, but not very useful from the standpoint of something else pretending to be a Postgres standby.</p>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>Postgres logical replication is a powerful feature central to the distributed/HA Postgres ecosystem. By using version 2 of the logical replication protocol to stream in-flight transactions, we can efficiently manage WAL spikes during sizable transactions, enhancing logical decoding performance and mitigating disk full issues caused by replication slot growth. Additionally, this approach reduces the lag between the Postgres source and its readers.</p>
<p>At <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, we're developing a feature that utilizes version 2 of the logical replication protocol to consume changes from a Postgres database before they are committed. We believe this feature will significantly benefit Postgres users grappling with issues related to replication slot growth. Overall, version 2 of the logical replication protocol presents a promising solution for optimizing Postgres replication processes and improving overall reliability and performance.</p>
]]></content:encoded></item><item><title><![CDATA[Enterprise-grade Replication from Postgres to Azure Event Hubs]]></title><description><![CDATA[At PeerDB, we are building a fast and a cost-effective way to replicate data from Postgres to Data Warehouses and Queues. Today we are releasing our Azure Event Hubs connector. With this, you get a fast, simple, and reliable way to Change Data Captur...]]></description><link>https://blog.peerdb.io/enterprise-grade-replication-from-postgres-to-azure-event-hubs</link><guid isPermaLink="true">https://blog.peerdb.io/enterprise-grade-replication-from-postgres-to-azure-event-hubs</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Azure]]></category><category><![CDATA[Microsoft]]></category><category><![CDATA[kafka]]></category><category><![CDATA[postgres]]></category><category><![CDATA[streaming]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[enterprise software]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Fri, 15 Mar 2024 20:54:59 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1709774738214/8c7b3f01-51ec-47c5-9722-24a34130aa60.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, we are building a fast and a cost-effective way to replicate data from <a target="_blank" href="https://www.postgresql.org/">Postgres</a> to Data Warehouses and Queues. Today we are releasing our <a target="_blank" href="https://azure.microsoft.com/en-us/products/event-hubs">Azure Event Hubs</a> connector. With this, you get a fast, simple, and reliable way to Change Data Capture (CDC) from PostgreSQL to Azure Event Hubs, enabling downstream apps to consume a raw feed of data from your PostgreSQL database in real-time. This enables use cases such as real-time alerting for Fraud or Anomaly detection in Banking/IoT, Operational Analytics, and more.</p>
<p>In this blog, we delve into existing approaches to replicate Postgres to Event Hubs and their challenges, as well as how PeerDB addresses these challenges to provide an Enterprise-grade experience!</p>
<h2 id="heading-status-quo">Status Quo</h2>
<h3 id="heading-debezium-is-hard-to-use-and-is-not-built-for-azure-event-hubs">Debezium is hard to use and is not built for Azure Event Hubs</h3>
<p>A common ways to replicate data from Postgres to Event Hubs is to use Open Source tools such as <a target="_blank" href="https://debezium.io/">Debezium</a>. Below are a few challenges that we've heard from customers trying Debezium with Azure Event Hubs.</p>
<ol>
<li><p><strong>Limited Configurability:</strong> Debezium offers limited customization for Azure Event Hubs, including the inability to perform advanced mapping between tables and topics, lack of support for custom partitioning schemes per topic, and inability to flatten nested JSONs, among other limitations.</p>
</li>
<li><p><strong>High Setup and Maintenance Costs:</strong> One of the common concerns we hear from customers is that setting up and managing Debezium at a production-grade level is challenging. It often requires several months of work by a data engineering team to fully implement.</p>
</li>
<li><p><strong>Not Native to Azure Event Hubs:</strong> Debezium leverages the Kafka protocol over Event Hubs to support the Event Hubs connector. The Kafka protocol is <a target="_blank" href="https://learn.microsoft.com/en-us/azure/event-hubs/apache-kafka-troubleshooting-guide">not as developed</a> as the native APIs provided by Event Hubs.</p>
</li>
</ol>
<h2 id="heading-peerdb-for-change-data-capture-cdc-from-postgres-to-azure-event-hubs">PeerDB for Change Data Capture (CDC) from Postgres to Azure Event Hubs</h2>
<p>In the past 6 months, we have invested heavily to make replication from Postgres to Azure Event Hubs as robust as possible. We have implemented multiple usability, security, and performance-related features required for enterprise customers. Below are a few highlights.</p>
<h3 id="heading-simple-to-use-sql-layer-that-makes-life-very-easy">Simple to Use - SQL Layer that makes life very easy!</h3>
<p>Along with a simple UI, PeerDB provides a Postgres-compatible SQL layer to manage replication from Postgres to Azure Event Hubs. You just need to run a couple of SQL commands to setup a highly reliable CDC pipeline: <a target="_blank" href="https://docs.peerdb.io/sql/commands/create-peer#eventhub-peer">CREATE PEER</a> to make PeerDB aware of the Postgres and Event Hubs peers; <a target="_blank" href="https://docs.peerdb.io/usecases/Real-time%20CDC/postgres-to-azure-eventhubs#step-2-real-time-cdc-from-postgresql-to-event-hubs">CREATE MIRROR</a> to kick off the replication job.</p>
<p>The Postgres-compatible SQL layer comes in very handy for managing replication from a fleet of Postgres databases across different tenants or micro services to Azure Event Hubs. You can script out your pipelines using Python or any other language and use any CI tool to manage your data pipelines.</p>
<p>The following demo showcases PeerDB in action, replicating data from Postgres, running a multi-tenant SaaS app, to Azure Event Hubs.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.loom.com/share/1846057942f141e4afdadc030f55a421">https://www.loom.com/share/1846057942f141e4afdadc030f55a421</a></div>
<p> </p>
<h3 id="heading-blazing-fast-performance-with-sub-second-latency">Blazing fast performance with Sub-Second latency</h3>
<p>Use cases requiring replication from Postgres to Azure Event Hubs are highly latency-sensitive. For instance, consider an IoT app publishing raw changes to Event Hubs. PeerDB implements multiple optimizations to provide sub-second latency at high throughputs (10K+ TPS). A few of the optimizations include:</p>
<ol>
<li><p><a target="_blank" href="https://blog.peerdb.io/building-a-streaming-platform-in-go-for-postgres">Streaming instead of batching</a></p>
</li>
<li><p>Always consuming the logical replication slot</p>
</li>
<li><p>Parallel apply for Azure Event Hubs</p>
</li>
<li><p><a target="_blank" href="https://devblogs.microsoft.com/azure-sdk/announcing-the-stable-release-of-the-azure-event-hubs-client-library-for-go/">Using native APIs (not the Kafka layer) to ingest into Azure Event Hubs</a></p>
</li>
</ol>
<h3 id="heading-highly-configurable-do-almost-anything-you-want">Highly Configurable - do almost anything you want!</h3>
<p>PeerDB provides many nuts and bolts to manage the behavior of CDC. You can control data formats/transformations, security/isolation, and performance while replicating data from Postgres to Azure Event Hubs. A few of them include:</p>
<ol>
<li><p><strong>Topics can be spread Namespaces and Subscriptions:</strong> You can replicate data from multiple Postgres tables to Event Hubs spread across namespaces and even subscriptions. This ensures guaranteed isolation across topics, which could be critical in multi-tenant SaaS apps.</p>
</li>
<li><p><strong>Define custom partition keys and partition counts across topics:</strong> To configure performance across topics, you can define custom partition keys and partition counts per topic.</p>
</li>
<li><p><strong>Flatten JSON and JSONB column:</strong> PeerDB allows you to deep flatten JSON and JSONB columns in Postgres into separate key&lt;&gt;value pairs on Azure Event Hubs.</p>
</li>
</ol>
<h3 id="heading-enterprise-grade-security-and-isolation">Enterprise grade Security and Isolation</h3>
<p>We designed the Azure Event Hubs connector specifically for Enterprise customers. Below are a few security features/items that PeerDB provides.</p>
<ol>
<li><p><strong>Guaranteed isolation across Azure Event Hubs topics:</strong> PeerDB provides the ability to replicate data from multiple tables in Postgres to separate topics spread across different namespaces and Azure subscriptions. This ensures guaranteed isolation across topics, which could be critical in multi-tenant SaaS apps, where you are providing raw DB feed to your customers.</p>
</li>
<li><p><strong>PeerDB Enterprise Offering:</strong> For enterprise customers, PeerDB provides the self-hosted offering, which comes with production-ready Helm charts and Enterprise-grade support. This enables you to provision PeerDB in <a target="_blank" href="https://azure.microsoft.com/en-us/products/kubernetes-service">Azure Kubernetes Services (AKS)</a> within your own VNET.</p>
</li>
</ol>
<h3 id="heading-production-ready-observability">Production ready Observability</h3>
<ol>
<li><p><strong>PeerDB UI:</strong> PeerDB comes with a comprehensive UI to monitor the replication jobs. You can monitor performance (throughput and latency), logs, and Postgres native metrics such as replication slot size. Additionally, you can create alerts for these metrics and send them to various channels such as Email and Slack.</p>
</li>
<li><p><strong>Integration with Azure Monitor:</strong> PeerDB Enterprise can run on Azure Kubernetes Services (AKS). AKS has out-of-the-box integration with Azure Monitor to manage metrics, logs and alerts.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709922781369/d14b632a-d72e-4e22-9ef2-d9b4c73b5dbb.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709774232242/6d91c922-ab1f-45bc-921a-3a47f282918b.png" alt="Monitor throughput and latency of replication" class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709774299256/b4d60297-8adc-45f8-a265-4b8146d4a94b.png" alt="Monitor Postgres replication slot growth" class="image--center mx-auto" /></p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Hope you enjoyed reading this blog! The Azure Event Hubs connector is being used in production by a few large-scale Postgres Azure customers. If you are interested in trying this out, please reach out to us through the <a target="_blank" href="https://www.peerdb.io/sign-up">Contact Us</a> form on our website.</p>
<p>We are actively working to extend similar support to other queues including Kafka and Google Pub Sub. If you are interested in previewing PeerDB for these queues, reach out to us through the <a target="_blank" href="https://www.peerdb.io/sign-up">Contact Us</a> form. We also offer a <a target="_blank" href="https://app.peerdb.cloud">30 day free trial for PeerDB Cloud</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Comparing Postgres Managed Services: AWS, Azure, GCP and Supabase]]></title><description><![CDATA[At PeerDB, we are building a fast and a cost-effective way to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse, Postgres and so on. All our customers run Postgres at the heart of the data stack, running fully ma...]]></description><link>https://blog.peerdb.io/comparing-postgres-managed-services-aws-azure-gcp-and-supabase</link><guid isPermaLink="true">https://blog.peerdb.io/comparing-postgres-managed-services-aws-azure-gcp-and-supabase</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[AWS]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Azure]]></category><category><![CDATA[supabase]]></category><category><![CDATA[postgres]]></category><dc:creator><![CDATA[Sai Srirampur]]></dc:creator><pubDate>Mon, 04 Mar 2024 17:24:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1708971058926/279bf17a-8b1c-477d-aed3-ddf6f8f724fb.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, we are building a fast and a cost-effective way to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse, Postgres and so on. All our customers run Postgres at the heart of the data stack, running fully managed or self-hosted Postgres databases.</p>
<p>We often get asked about the preferred managed service for PostgreSQL. In that spirit, we are writing this blog to compare four popular options incl. <a target="_blank" href="https://aws.amazon.com/rds/postgresql/">AWS RDS Postgres</a>, <a target="_blank" href="https://azure.microsoft.com/en-us/products/postgresql/?ef_id=_k_Cj0KCQiA5-uuBhDzARIsAAa21T8Lx70H4gC97Kz9axfkTXKAI9m0aNfNuqSTpVnuuCepfNo725BrSy0aAk-JEALw_wcB_k_&amp;OCID=AIDcmm5edswduu_SEM__k_Cj0KCQiA5-uuBhDzARIsAAa21T8Lx70H4gC97Kz9axfkTXKAI9m0aNfNuqSTpVnuuCepfNo725BrSy0aAk-JEALw_wcB_k_&amp;gad_source=1&amp;gclid=Cj0KCQiA5-uuBhDzARIsAAa21T8Lx70H4gC97Kz9axfkTXKAI9m0aNfNuqSTpVnuuCepfNo725BrSy0aAk-JEALw_wcB">Azure Flexible Server Postgres</a>, <a target="_blank" href="https://cloud.google.com/sql/postgresql?hl=en">GCP Cloud SQL for Postgres</a>, and <a target="_blank" href="https://supabase.com/docs/guides/database/overview">Supabase Postgres</a>, across Performance, Costs and Features. We also acknowledge other providers like <a target="_blank" href="https://tembo.io/">Tembo</a>, <a target="_blank" href="https://www.crunchydata.com/products/crunchy-bridge">Crunchy Bridge</a>, <a target="_blank" href="https://neon.tech/">Neon</a> and <a target="_blank" href="https://www.timescale.com/">TimescaleDB</a> which we'll cover in a future post.</p>
<p>Note that this this comparison aims to serve as a helpful <strong>"first"</strong> checklist for developers choosing a managed service. There may be something we missed, and we apologize for those oversights. We are happy to adjust our analysis based on feedback.</p>
<h1 id="heading-setup">Setup</h1>
<p>To ensure an apples-to-apples comparison, we aimed to match the four options as closely as possible in terms of RAM, vCores, disk space, PostgreSQL version, region, etc. The table below captures the details of the initial setup.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Cloud</strong></td><td><strong>AWS</strong></td><td><strong>GCP</strong></td><td><strong>Azure</strong></td><td><strong>Supabase</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Region</strong></td><td>us-east-1</td><td>us-east1</td><td>East US</td><td>East US</td></tr>
<tr>
<td><strong>PG Version</strong></td><td>16.1-R2</td><td>15, 16 unavailable</td><td>16</td><td>15, 16 unavailable</td></tr>
<tr>
<td><strong>DB Type</strong></td><td>db.m6i.large</td><td>Enterprise -&gt; Sandbox</td><td>Standard_D2s_v5</td><td>Large</td></tr>
<tr>
<td><strong>RAM</strong></td><td>8</td><td>8</td><td>8</td><td>8</td></tr>
<tr>
<td><strong>vCores</strong></td><td>2</td><td>2</td><td>2</td><td>2</td></tr>
<tr>
<td><strong>Disk Size</strong></td><td>100</td><td>100</td><td>100</td><td>100</td></tr>
<tr>
<td><strong>Disk Type / IOPs</strong></td><td>gp3 (3000)</td><td>3000</td><td>Premium SSD v2 (3000)</td><td>Not specified</td></tr>
<tr>
<td><strong>Default Arch</strong></td><td>x64</td><td>Not specified (probably x64)</td><td>x64</td><td>ARM</td></tr>
<tr>
<td><strong>HA</strong></td><td>Not enabled</td><td>Not enabled</td><td>Not enabled</td><td>Not enabled</td></tr>
<tr>
<td><strong>DB Disk Type (IOPS)</strong></td><td>SSD gp3 (3000)</td><td>3000</td><td>Premium SSD v2 (3000)</td><td>Not specified</td></tr>
</tbody>
</table>
</div><h1 id="heading-performance"><strong>Performance</strong></h1>
<h2 id="heading-benchmark-setup">Benchmark Setup</h2>
<p>All the performance tests were conducted using a VM (client) with the same compute capacity and collocated in the same region as the PostgreSQL database. We did 3 main performance tests:</p>
<ol>
<li><p><a target="_blank" href="https://www.postgresql.org/docs/current/pgbench.html">pgbench</a> representing a typical Transactional (OLTP) workload</p>
</li>
<li><p>COPY command to Batch Insert (Upload) data to Postgres</p>
</li>
<li><p>SELECT command to Batch download data from Postgres</p>
</li>
</ol>
<h2 id="heading-pgbench">pgbench</h2>
<p>Across all the 4 managed PostgreSQL providers, <code>pgbench</code> was run for 24 hours with 8 parallel connections and 4 jobs <code>pgbench -c 8 -j 4 -P 30</code>. The graphs below capture a comparison of average throughput i.e. transactions per second (TPS), average latency and average CPU utilization for all the services.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713398099459/9accafd1-7f9d-47c1-b853-7d116f19258a.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1713398189849/3f8baab8-e5b0-4366-9951-b002cf674324.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708901669178/b0d17cf0-08b6-4009-9f53-3f9015a1146c.png" alt class="image--center mx-auto" /></p>
<p><strong>AWS RDS PostgreSQL led the pack with an average of 2.7K TPS and 2.884 ms average latency. Azure Flexible Server PostgreSQL ranked second, closely trailing AWS RDS by just ~12%. It recorded an average of 2.4K TPS and an average latency of 3.260 ms.</strong> Supabase and GCP Cloud SQL PostgreSQL followed. Average CPU utilization across all the services was almost the same i.e. around 80%, except for Supabase. This could be because Supabase uses ARM processors compared to others who use x86.</p>
<h2 id="heading-batch-upload-and-download">Batch Upload and Download</h2>
<p>For batch uploads, we used the COPY command to insert 1GB and 5GB files from the client to PostgreSQL. For batch downloads, we executed a SELECT query that retrieved 1GB and 5GB of data from a table in PostgreSQL to the client. The graphs below illustrate how each service performed in these tests:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708902517962/f21a8286-c4dd-4fed-bae5-8cdb8f318f00.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708902572000/1347b1bd-6776-4015-bfcb-2fb44daf674a.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709248495715/ca3925ca-d942-4691-999c-2b1ba054994a.png" alt class="image--center mx-auto" /></p>
<p>In terms of batch upload with COPY command, <strong>AWS RDS was again the leader taking around ~105s to ingest 5GB of data</strong>. GCP Cloud SQL was second with 113s. Azure Flexible Server and Supabase followed.</p>
<p>In terms of batch download using SELECT, the numbers were close across AWS, GCP, and Azure, <strong>with GCP slightly ahead, taking 51 seconds to download 5GB data</strong>. It was interesting to note that Supabase took longer than the others, requiring 160 seconds to download 5GB of data.</p>
<p>CPU utilization peaks during the COPY command were almost consistent across AWS and GCP, at around ~45-50%. Supabase was at approximately 57%. However Azure peaked at 85%.</p>
<h1 id="heading-costs"><strong>Costs</strong></h1>
<p>Below table captures costs across all the 4 managed services for a Postgres Database with 2vCPU, 8GB RAM and 100GB disk. More details regarding the infra can be found in this <a target="_blank" href="https://docs.google.com/spreadsheets/d/1IjKBOT8R2QP065rx9G0F7RzLyoN0ZFT6LZbCn3Z1leI/edit?usp=sharing">sheet</a>.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>AWS</strong></td><td><strong>GCP</strong></td><td><strong>Azure</strong></td><td><strong>Supabase</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Costs per month</strong></td><td>$129.94</td><td>$116.70</td><td>$129.94</td><td>$113.00</td></tr>
<tr>
<td><strong>Disk Cost per month</strong></td><td>$11.50</td><td>N/A</td><td>$11.50</td><td>N/A</td></tr>
<tr>
<td><strong>Total Cost per month</strong></td><td><strong>$141.44</strong></td><td><strong>$116.70</strong></td><td><strong>$141.44</strong></td><td><strong>$113.00</strong></td></tr>
</tbody>
</table>
</div><p>If you notice, Supabase is the most cost-effective compared to other managed services, at $113. This <strong>could</strong> be because Supabase uses machines with ARM processors, which are more cost-effective compared to x64. GCP Cloud SQL comes in second at $116 per month. AWS RDS and Azure Flexible Server are tied at $141.44 per month.</p>
<h1 id="heading-database-features"><strong>Database Features</strong></h1>
<p>Postgres Managed Services typically support various important features for running production and enterprise-grade Postgres deployments. A few important features include:</p>
<p><strong>Availability and Reliability:</strong></p>
<ol>
<li><p>High Availability (HA) to minimize downtime during DB failures/crashes.</p>
</li>
<li><p>Backups / Point-In-Time-Recovery to handle Disaster Recovery (DR) scenarios</p>
</li>
<li><p>Cross region read replicas for enterprise-grade DR</p>
</li>
</ol>
<p><strong>Performance and functionality:</strong></p>
<ol>
<li><p>Out of the box feature to help performance tuning of queries.</p>
</li>
<li><p>Read-replicas to segregate and scale read workloads</p>
</li>
<li><p>Out of the box connection pooling</p>
</li>
<li><p>Extension to enhance Postgres functionality</p>
</li>
</ol>
<p><strong>Security and Compliance:</strong></p>
<ol>
<li><p>SOC2 and HIPAA</p>
</li>
<li><p>Private Access</p>
</li>
</ol>
<p>The table below compares each of the four managed services based on the above features:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Feature</strong></td><td><strong>AWS</strong></td><td><strong>GCP</strong></td><td><strong>Azure</strong></td><td><strong>Supabase</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>PITR</strong></td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
<tr>
<td><strong>HA</strong></td><td>Yes</td><td>Yes</td><td>Yes</td><td><a target="_blank" href="https://github.com/orgs/supabase/discussions/1504">Unclear</a></td></tr>
<tr>
<td><strong>HA across Availability Zones</strong></td><td><a target="_blank" href="https://aws.amazon.com/rds/features/multi-az/">Yes</a></td><td><a target="_blank" href="https://cloud.google.com/sql/docs/postgres/high-availability">Yes</a></td><td>Yes</td><td>No</td></tr>
<tr>
<td><strong>Cross region read replicas</strong></td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes (In early access)</td></tr>
<tr>
<td><strong>Availability SLA</strong></td><td><a target="_blank" href="https://aws.amazon.com/blogs/aws/rds-postgres-sla/">99.95</a></td><td><a target="_blank" href="https://cloud.google.com/sql/sla">99.95 with Enterprise, 99.99 with Enterprise Plus</a></td><td><a target="_blank" href="https://learn.microsoft.com/en-us/azure/reliability/reliability-postgresql-flexible-server#sla">99.95 within AZ, 99.99 with cross AZ HA deployments</a></td><td><a target="_blank" href="https://supabase.com/sla">99.9</a></td></tr>
<tr>
<td><strong>Performance Insights</strong></td><td><a target="_blank" href="https://aws.amazon.com/rds/performance-insights/">Yes</a></td><td><a target="_blank" href="https://cloud.google.com/sql/docs/postgres/using-query-insights">Yes</a></td><td><a target="_blank" href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-intelligent-tuning">Yes</a></td><td>Not out-of-the-box but through <a target="_blank" href="https://supabase.com/docs/guides/platform/performance">SQL queries</a></td></tr>
<tr>
<td><strong>Read replicas</strong></td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes (In early access)</td></tr>
<tr>
<td><strong>Connection Pooling</strong></td><td>Yes with <a target="_blank" href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html">RDS Proxy</a></td><td>No</td><td>Yes with <a target="_blank" href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-pgbouncer">PGBouncer</a></td><td>Yes with <a target="_blank" href="https://supabase.com/blog/supavisor-postgres-connection-pooler">Supavisor</a></td></tr>
<tr>
<td><strong>Number of Extensions</strong></td><td><a target="_blank" href="https://gist.github.com/saisrirampur/238f9b886f5543f639dea21a4c37abb7">92</a>, <a target="_blank" href="https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLReleaseNotes/postgresql-extensions.html">Official Docs</a></td><td><a target="_blank" href="https://gist.github.com/saisrirampur/b15d6f9f3c6fb4bdc0adbe3cd42e3a16">74</a>, <a target="_blank" href="https://cloud.google.com/sql/docs/postgres/extensions">Official Docs</a></td><td><a target="_blank" href="https://gist.github.com/saisrirampur/3b705668c8a86386ac10ca4380ac0613">75</a>, <a target="_blank" href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-extensions#extension-versions">Official Docs</a></td><td><a target="_blank" href="https://gist.github.com/saisrirampur/06462d9dc9cfba122a0179d9145e5033">81</a>, <a target="_blank" href="https://supabase.com/docs/guides/database/extensions">Official Docs</a></td></tr>
<tr>
<td><strong>Private Access</strong></td><td><a target="_blank" href="https://aws.amazon.com/blogs/database/access-amazon-rds-across-vpcs-using-aws-privatelink-and-network-load-balancer/">Yes</a></td><td><a target="_blank" href="https://cloud.google.com/sql/docs/postgres/configure-private-services-access">Yes</a></td><td><a target="_blank" href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-networking-private-link">Yes</a></td><td>No</td></tr>
<tr>
<td><strong>SOC2</strong></td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
<tr>
<td><strong>HIPAA</strong></td><td>Yes</td><td><a target="_blank" href="https://cloud.google.com/security/compliance/hipaa#overview">Yes</a></td><td>Yes</td><td>Yes</td></tr>
</tbody>
</table>
</div><h1 id="heading-conclusion">Conclusion</h1>
<p>Below is a summary of the results from the analyses conducted across the four managed services.</p>
<ol>
<li><p>AWS RDS Postgres was the most mature Postgres offering of all the other managed services.</p>
<ol>
<li><p>Performance-wise, it surpassed Azure by just 12% and exceeded the others by over 45% in pgbench throughput and latency.</p>
</li>
<li><p>Feature-wise, it supports almost all of them in the Availability and Reliability, Performance, and Security and Compliance categories.</p>
</li>
<li><p>It supports the highest number of extensions, i.e., 92 of them.</p>
</li>
</ol>
</li>
<li><p>Azure Flexible Server takes second place in performance. It was very close to AWS, being only about 12% lower in performance. It matches AWS RDS Postgres in terms of features.</p>
</li>
<li><p>Managed services across all three clouds offer robust support for features related to Availability &amp; Reliability and Security &amp; Compliance, which are important for enterprise-grade workloads.</p>
</li>
<li><p>Supabase and GCP Cloud SQL Postgres are the most cost-effective of all the managed services.</p>
</li>
<li><p>Special mention to Supabase for supporting <a target="_blank" href="https://supabase.com/docs/guides/getting-started/features">features</a> that make the lives of app developers incredibly easy.</p>
</li>
</ol>
<p>Hope you enjoyed reading this blog. In future blogs we will add a few other managed services to this comparison and aim to go deeper in a few categories such as Performance.</p>
<h1 id="heading-references">References</h1>
<p><a target="_blank" href="https://docs.google.com/spreadsheets/d/1IjKBOT8R2QP065rx9G0F7RzLyoN0ZFT6LZbCn3Z1leI/edit?usp=sharing">Excel sheet capturing all our raw analysis to come up with this blog</a></p>
<p><strong>NOTE:</strong> The blog was updated on April 17, 2023. The primary modification involved changing the Azure Flexible Server VM type from AMD (Standard_D2ads_v5) to Intel (Standard_D2s_v5). This change can be easily configured through radio buttons while provisioning and is set as the default across various regions. Therefore, we deemed it a fair modification in the comparison.</p>
]]></content:encoded></item><item><title><![CDATA[Moving a Billion Postgres Rows on a $100 Budget]]></title><description><![CDATA[Inspired by the 1BR Challenge, I wanted to see how much it would cost to transfer 1 billion rows from Postgres to Snowflake. Moving 1 billion rows is no easy task. The process involves not just the transfer of data but ensuring its integrity, error r...]]></description><link>https://blog.peerdb.io/moving-a-billion-postgres-rows-on-a-100-budget</link><guid isPermaLink="true">https://blog.peerdb.io/moving-a-billion-postgres-rows-on-a-100-budget</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[snowflake]]></category><category><![CDATA[replication]]></category><category><![CDATA[ETL]]></category><dc:creator><![CDATA[Kaushik Iska]]></dc:creator><pubDate>Wed, 21 Feb 2024 19:20:32 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1708528685112/bbd6936f-31e8-4f28-bca7-957eb2bf0e4e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Inspired by the <a target="_blank" href="https://github.com/gunnarmorling/1brc">1BR Challenge</a>, I wanted to see how much it would cost to transfer 1 billion rows from Postgres to Snowflake. Moving 1 billion rows is no easy task. The process involves not just the transfer of data but ensuring its integrity, error recovery and consistency post-migration.</p>
<p>Central to this task is the selection of tools and techniques. We will discuss the use of open-source tools, customized scripts, ways to read data from Postgres, and Snowflake’s data loading capabilities. Key aspects like parallel processing, efficiently reading Postgres’ <a target="_blank" href="https://www.postgresql.org/docs/current/wal-intro.html">WAL</a>, data compression and incremental batch loading on Snowflake will be highlighted.</p>
<p>I will list and discuss some of the optimizations that are implemented to minimize compute, network, and warehouse costs. Additionally, I will highlight some of the trade-offs made as part of this process. Given that most of the approaches covered in this blog stem from my explorations at <a target="_blank" href="https://github.com/PeerDB-io/peerdb">PeerDB</a> aimed at enhancing our product – The task was accomplished primarily through <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>.</p>
<p>I want to make it clear that there are some feature gaps in comparison to a mature system, and it might not be practical for all use cases. However, it does handle the most common use cases effectively while significantly reducing costs. I also want to caveat that there might be some ways in which the estimations may be off and I’d be happy to adjust based on feedback.</p>
<h1 id="heading-setup">Setup</h1>
<ul>
<li><p><strong>Initial data load:</strong> We will consider that there are 300M rows already in the table at the start of the task, and our system should handle the initial load of all the rows.</p>
</li>
<li><p><strong>Inserts, Updates and Deletes (Change Data Capture):</strong> The rest of the 700M rows will be a combination of inserts, updates and deletes. <a target="_blank" href="https://wiki.postgresql.org/wiki/TOAST">Including support for toast columns</a>.</p>
<ul>
<li>1024 rows changed per second for ~8 days.</li>
</ul>
</li>
<li><p><strong>Recoverability:</strong> We will reboot the system every 30 mins to ensure that it's robust and can recover from disasters.</p>
</li>
</ul>
<p>Now let us walk through an engineering design that optimally handles the above workload with the objective of <strong>minimizing costs</strong> and <strong>improving performance</strong>, one step at a time.</p>
<h1 id="heading-initial-load-from-postgres-to-snowflake">Initial Load from Postgres to Snowflake</h1>
<p>Let’s start with the first operation any data sync job has to do: load the initial set of data from the source to destination. There are a few challenges that come with this:</p>
<ol>
<li><p>How to efficiently retrieve large amounts of data from Postgres?</p>
</li>
<li><p>How to process the data in a way where we have minimal cost foot-print?</p>
</li>
<li><p>How to efficiently load this data to Snowflake?</p>
</li>
</ol>
<h2 id="heading-optimal-data-retrieval-from-postgres">Optimal Data retrieval from Postgres</h2>
<p>Reading a table sequentially from Postgres is slow. It would take a long time to read 300M rows from Postgres. To make this process more efficient, <a target="_blank" href="https://duckdb.org/2022/09/30/postgres-scanner.html#parallelization">we have to parallelize</a>. We've got a clever way to quickly read parts of a table in Postgres using something called the TID Scan, which is a bit of a hidden gem. Basically, it lets us pick out specific chunks of data as stored on disk, identified by their <a target="_blank" href="https://www.postgresql.org/docs/current/ddl-system-columns.html#id-1.5.4.7.4.6.2.1">Tuple IDs</a> (CTIDs), which look like <code>(page, tuple)</code>. This optimizes IO utilization and is super handy for reading big tables efficiently.</p>
<p>Here's how we do it: we divide the table into partitions based on the pages of the database, and each partition gets its own scan task. Each task handles about 500K rows. So, we partition the table into CTID ranges, with each partition having ~500K rows, and we process each partition parallelly (16 partitions at a time).</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">FROM</span> public.challenge_1br; <span class="hljs-comment">-- find the count</span>

<span class="hljs-comment">-- num_partitions = (count // rows_per_partition)</span>

<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">bucket</span>, <span class="hljs-keyword">MIN</span>(ctid) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">start</span>, <span class="hljs-keyword">MAX</span>(ctid) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">end</span>
<span class="hljs-keyword">FROM</span> (
    <span class="hljs-keyword">SELECT</span> NTILE(<span class="hljs-number">1000</span>) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> ctid) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">bucket</span>, ctid 
  <span class="hljs-keyword">FROM</span> public.challenge_1br
) subquery
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">bucket</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">start</span>;
</code></pre>
<p><img src="https://lh7-us.googleusercontent.com/qo41LQQwhVKZT9mWROXxYWr-eKYUu2_EcJ9Elcn49Mfk-vpuIBvz54sBmxWr7W2Z0quqiujKPkQWA3omYn_VGaYf8MWJDVNx4EzcGYFWa4ofE-zMfU9k6U76ZcBsZe5A4o0Tkf3p978w9bpqnN_3MkI" alt /></p>
<h2 id="heading-data-in-transit">Data in Transit</h2>
<p>It is important to process the data in a way where we don’t overload the system. As we are operating under budget constraints, we need to use techniques that use the hardware effectively. We are going to be using the “<a target="_blank" href="https://twitter.com/garybernhardt/status/600783770925420546?s=20">your dataset fits in RAM</a>'' paradigm of systems design. 300M rows for initial load does sound like a lot, but let's see how we can make it fit in our RAM. We need to process the data to ensure <a target="_blank" href="https://blog.peerdb.io/role-of-data-type-mapping-in-database-replication">data-types are mapped correctly to the destination</a>. We are going to convert the query results to <a target="_blank" href="https://avro.apache.org/docs/1.11.0/index.html">Avro</a> for faster loading into warehouses, and also <a target="_blank" href="https://avro.apache.org/docs/1.11.0/spec.html#Logical+Types">for its logical type support</a>.</p>
<h3 id="heading-how-big-is-the-data">How big is the data?</h3>
<p>Let us take a little detour to explore how big the data is. This is a good chance to look at some real world examples to estimate things. Based on interacting with a lot of production customers, and talking to some experts, it’s safe to say that on an average we see ~15 columns per table. In our table, let’s say each row is ~512 bytes.</p>
<pre><code class="lang-python"><span class="hljs-comment"># for initial load</span>
num_rows = <span class="hljs-number">300</span>_000_000
bytes_per_row = <span class="hljs-number">512</span>
total_num_bytes = num_rows * bytes_per_row
total_size_gb = total_num_bytes / <span class="hljs-number">1</span>_000_000_000
<span class="hljs-comment"># total initial load size 153.6 GB</span>

<span class="hljs-comment"># memory required during initial load</span>

num_rows_per_partition = <span class="hljs-number">500</span>_000
mb_per_partition = num_rows_per_partition * bytes_per_row / <span class="hljs-number">1</span>_000_000 <span class="hljs-comment"># 256 MB</span>
num_partitions_in_parallel = <span class="hljs-number">16</span>
required_memory = num_partitions_in_parallel * mb_per_partition <span class="hljs-comment"># 4096 MB</span>
</code></pre>
<h3 id="heading-required-memory">Required Memory</h3>
<p>Based on the above napkin math, we can see that with 4GB of RAM we should be able to do the initial load. We will allocate 8GB of RAM to account for other components.</p>
<h2 id="heading-efficiently-loading-data-into-snowflake">Efficiently loading data into Snowflake</h2>
<p>As mentioned earlier we are going to store the query results into Avro on-disk. We are further going to compress the Avro files using <a target="_blank" href="https://github.com/facebook/zstd">zstd</a> to further reduce the disk footprint and also to save on network costs. We will take a slight deviation from the topic to talk about Bandwidth costs.</p>
<h3 id="heading-bandwidth-costs-they-can-break-the-bank">Bandwidth costs: They can break the bank!</h3>
<p>Let's look at the network costs, you can see the variance in numbers.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Cost per 10GB (egress)</strong></td><td><strong>AWS</strong></td><td><strong>GCP</strong></td><td><strong>Azure</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Within same AZ</strong></td><td>Free</td><td>Free</td><td>Free</td></tr>
<tr>
<td><strong>Within same region (different AZ)</strong></td><td>$0.1</td><td>$0.1</td><td>$0.1</td></tr>
<tr>
<td><strong>Across Regions</strong></td><td>$0.1 - $0.2 (<a target="_blank" href="https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer">Depends on Destination</a>)</td><td>$0.2 - $1.4 (<a target="_blank" href="https://cloud.google.com/vpc/network-pricing#inter-region-data-transfer">Depends on source+destination</a>)</td><td>$0.2 - $1.6 (<a target="_blank" href="https://azure.microsoft.com/en-in/pricing/details/bandwidth/">Depends on region + +intra/inter continental</a>)</td></tr>
<tr>
<td><strong>To Internet</strong></td><td>$0.9 - $0.5 (10TB - 150TB)</td><td>$0.8 - $2.3 (<a target="_blank" href="https://cloud.google.com/vpc/network-pricing#internet_egress">Premium tier - Depends on Source+Destination</a>)</td><td>$1.81 - $0.5 (<a target="_blank" href="https://azure.microsoft.com/en-in/pricing/details/bandwidth/">MS Premium NW - Depends on source + usage</a>)</td></tr>
</tbody>
</table>
</div><p>It’s interesting to see the variance in the costs, so it’s best to have Postgres, our System and Snowflake in the same cloud provider and the same region. Let’s now calculate the networks costs needed for this workload.</p>
<h3 id="heading-calculating-network-costs">Calculating Network Costs</h3>
<p>Another thing to be wary of is the Warehouse configuration.</p>
<pre><code class="lang-python">bytes_per_row = <span class="hljs-number">512</span>
num_rows = <span class="hljs-number">1</span>_000_000_000
total_data_size = <span class="hljs-number">512</span>GB
compressed_data_size_GB = <span class="hljs-number">256</span> <span class="hljs-comment">#avro+zstd gives atleast 2x compression</span>
bandwidth_cost_per_10GB = $<span class="hljs-number">0.1</span>

<span class="hljs-comment"># total nework costs</span>
<span class="hljs-comment"># data_size_GB * bandwidth_cost_per_10GB / 10</span>
network_costs_egress_from_postgres = $<span class="hljs-number">5</span>
<span class="hljs-comment"># compressed_data_size_GB * bandwidth_cost_per_10GB / 10</span>
network_costs_egress_from_system_to_snowflake = $<span class="hljs-number">2.56</span> 

network_costs = $<span class="hljs-number">7.56</span>
</code></pre>
<h3 id="heading-snowflake-warehouse-configuration">Snowflake Warehouse Configuration</h3>
<p>In many organizations, a significant portion of Snowflake expenses comes from compute usage, particularly when warehouses run idle between tasks. Snowflake's compute costs are accrued based on warehouse operational time, starting from activation to suspension. Often, idle warehouse time can contribute to 10%-25% of the total Snowflake compute costs. The Baselit team wrote an excellent blog about it: <a target="_blank" href="https://baselit.ai/blogs/fastest-way-save-snowflake">read more about it here</a>.</p>
<p>The two things we will be doing is to set <code>AUTO_SUSPEND</code> to be 60 seconds, a warehouse idles for up to a minute after the last query before pausing, and make sure that we keep the warehouse active for the least amount of time. This is the default configuration you get if you follow the <a target="_blank" href="https://docs.peerdb.io/connect/snowflake">PeerDB Snowflake setup guide</a>.</p>
<h2 id="heading-inserts-updates-and-deletes">Inserts, Updates and Deletes</h2>
<p>The next challenge for us after the initial load would be to read the change data from Postgres and replaying that to Snowflake. We are going to be doing that using Postgres’ Logical Replication. At the start of the replication, we will create a replication slot and use <a target="_blank" href="https://www.postgresql.org/docs/current/logical-replication-architecture.html">pgoutput</a> plugin. This is the recommended way to read changes from the slot. Once we read the changes from the slot, we will batch them and then load them to Snowflake.</p>
<p>As we discussed earlier, it is important to keep the Snowflake warehouse idle for as long as we can, and batching helps with that. We store records in batches of 1M to Avro like before, and load them to an <a target="_blank" href="https://docs.snowflake.com/en/user-guide/data-load-local-file-system-create-stage">internal stage</a> in Snowflake. Once the data is loaded into the stage, we will <a target="_blank" href="https://docs.snowflake.com/en/sql-reference/sql/merge">MERGE</a> the records from the stage into the destination table. This way most of the heavy-lifting of the resolution is left to the warehouse and it simplifies our system.</p>
<h2 id="heading-tools">Tools</h2>
<p>At <a target="_blank" href="https://www.peerdb.io/">PeerDB</a>, we are building a specialized data-movement tool for Postgres with laser focus on Postgres to Data Warehouse replication. Most of the above optimizations incl. <a target="_blank" href="https://blog.peerdb.io/parallelized-initial-load-for-cdc-based-streaming-from-postgres">parallel initial load</a>, <a target="_blank" href="https://blog.peerdb.io/reducing-bigquery-costs-by-260x">reducing Data Warehouse costs</a>, <a target="_blank" href="https://blog.peerdb.io/role-of-data-type-mapping-in-database-replication">native data-type mapping</a>, <a target="_blank" href="https://github.com/PeerDB-io/peerdb/pull/111">support of TOAST columns</a>, <a target="_blank" href="https://blog.peerdb.io/using-temporal-to-scale-data-synchronization-at-peerdb">fault-tolerance and auto recovery</a> etc. are already baked into the product. PeerDB is also <a target="_blank" href="https://github.com/PeerDB-io/peerdb">Free and Open</a>. So we chose PeerDB to implement the above workload.</p>
<h2 id="heading-hardware">Hardware</h2>
<p>Now that we have landed on 8GB RAM, let us move onto picking the instance type.</p>
<p>Since ARM uses lower energy compared to x64 (due to being RISC), they are around 25% cheaper as compared to x64 machines. The tradeoff here is that x64 machines run at around 2.9GHz with a 3.5GHz Turbo (M6i instances) as compared to ARM machines at about 2.5GHz (Graviton2 - M6g) but M6i instances are about 30% more expensive as compared to M6g instances.</p>
<p>Effective cost is $0.0409/GHz for x64 vs $0.03616/GHz for ARM, so cost is about 13% more per GHz on x64 <strong>But cost per GHz is not the determining factor for reading in a single thread from Postres during CDC as replication slots can be read from a single process at once.</strong></p>
<p>For this current experiment, I went with <code>m6gd.large</code> as it offers a good balance of speed and disk.</p>
<p><strong>Optional read:</strong> In this blog we will use AWS for our analysis. However, here are some other learnings we had on this topic. OVH Cloud currently <a target="_blank" href="https://github.com/ovh/public-cloud-roadmap/issues/343">does not support ARM</a> Instances and has a similar $0.118/hour <code>c2-7</code> instance (in limited regions) which has a <a target="_blank" href="https://www.ovhcloud.com/asia/public-cloud/prices/#410">very low network speed</a> (250MBps) with 50GB of SSD. <a target="_blank" href="https://www.hetzner.com/cloud/">Hetzner</a> has a <code>CCX13</code> $0.0292/hour instance (including a 118GB SSD) but no dedicated ARM instances.</p>
<p><img src="https://lh7-us.googleusercontent.com/Mmy4pyDQ_qnBrX7PJH830MLgSj0s4u1UOZOzMhlr32YrA8ewQzEMOCkk6e3bXkgeVYfwVYrLt_Ofs6YENVIos5N_daFgBp2t6KZ59n9X2EWvwJWmVsCmAU82yyHzAxQuc0BX7ccGmpZqpdsEszTO4lU" alt /></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>One question that I'm often posed with: <strong>“Is this practical?”</strong>. Yes, one machine can die, but systems where there is only one machine have a <a target="_blank" href="https://twitter.com/danluu/status/1586180166631706624?s=20">remarkable amount of uptime</a>, especially when the state is stored in a durable way.</p>
<p>Back to the topic at hand. If we look at the total cost of the system we built (assuming <code>us-west-2</code> as the region. Over a month time this is the breakdown:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Cost Category</td><td>Cost</td><td>Comment</td></tr>
</thead>
<tbody>
<tr>
<td>Hardware</td><td>$65.992 / month</td><td>AWS m6gd.large (2 vcpus, 8 GB RAM)</td></tr>
<tr>
<td>Comes with 118 GB NVMe which is great!</td><td></td><td></td></tr>
<tr>
<td>Network</td><td>$7.56</td><td>AWS network transfer same region 500 GB (with compression)</td></tr>
<tr>
<td>Warehouse</td><td>N/A</td><td>These are common across various vendors</td></tr>
<tr>
<td><strong>Total</strong></td><td><strong>$73.552</strong></td><td><strong>Hardware Costs + Network costs = $65.992 + $7.56 = $73.552 (Within $100 budget)</strong></td></tr>
</tbody>
</table>
</div><p>If we were to look at various ETL tools and how much they charge for moving 1 billion rows, this is what it comes out to:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Vendor</strong></td><td><strong>Cost per 1 billion records</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Fivetran</td><td>$23,157.89</td></tr>
<tr>
<td>Airbyte</td><td>$11,760.00</td></tr>
<tr>
<td>Stitch Data</td><td>$4,166.67</td></tr>
<tr>
<td>Above Approach (using <a target="_blank" href="https://github.com/PeerDB-io/peerdb">PeerDB OSS</a>)</td><td>$73.552</td></tr>
</tbody>
</table>
</div><p>I am part of a company building a software for moving data specifically from Postgres to Data warehouses. It's my job to figure out how to provide the best experience to our customers. Doing this project forced me to figure out a way to provide the best bang for buck, and to include a lot of the explored features <a target="_blank" href="https://www.peerdb.io/">into PeerDB</a>. I hope it conveys some appreciation for what modern hardware is capable of, and how much you can get out of it.</p>
]]></content:encoded></item><item><title><![CDATA[PeerDB UI - Deeper Dive: Part 1]]></title><description><![CDATA[At PeerDB, we are building a fast and cost-effective way to replicate data from Postgres to Data Warehouses such as BigQuery, Snowflake and ClickHouse.
When building PeerDB UI, we wanted it to be minimal but effective. Features were driven by what th...]]></description><link>https://blog.peerdb.io/peerdb-ui-deeper-dive-part-1</link><guid isPermaLink="true">https://blog.peerdb.io/peerdb-ui-deeper-dive-part-1</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Design]]></category><category><![CDATA[replication]]></category><category><![CDATA[cdc]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[snowflake]]></category><category><![CDATA[bigquery]]></category><category><![CDATA[PeerDB]]></category><dc:creator><![CDATA[Kaushik Iska]]></dc:creator><pubDate>Fri, 16 Feb 2024 17:44:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1708105260023/49420afa-9c62-42fc-a2ba-ae56037f0b56.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At PeerDB, we are building a fast and cost-effective way to replicate data from Postgres to Data Warehouses such as BigQuery, Snowflake and ClickHouse.</p>
<p>When building PeerDB UI, we wanted it to be minimal but effective. Features were driven by what the customers really needed, while keeping the bloat low. For this article, I've asked the team to share their favorite part about the UI.</p>
<h2 id="heading-replication-slot-growth-chart">Replication Slot Growth Chart</h2>
<p>This chart shows the size of slot in GB over time.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708012762932/c9bfcbf5-8d28-4828-bd69-a0f99c3b60c9.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-activity-monitor">Activity Monitor</h2>
<p>This view captures all the activity and connections open for the database.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708013023555/6d0d1b89-48ce-4f6a-b8f9-2952c8115f6f.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-slack-alerts">Slack Alerts</h2>
<p>Alerting configuration for Slack, back to our minimal roots, where we show the configured alerts channel.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708013242662/5ff290b3-8506-4da0-8d77-8224d4acbfbf.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-mirror-rows-over-time">Mirror Rows Over Time</h2>
<p>A simple histogram view over time where we see the number of rows synced.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708013503552/efeede23-46f9-499e-8e9a-cc43cfb5929b.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-timezone-selector">Timezone Selector</h2>
<p>Simple touch to select the timezones.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708013553856/f8b065b8-2886-4158-9b8c-bc095383a744.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>These are just a glimpse of PeerDB UI. We tried to cover some of the features that make PeerDB UI unique and helpful. I'm hoping that we add a lot more of these useful features.</p>
<p>We hope you enjoyed reading the blog. If you're a Postgres user and wish to replicate data from Postgres to Snowflake/BigQuery/ClickHouse using PeerDB, please check out the links below or reach out to us directly!</p>
<ol>
<li><p><a target="_blank" href="https://app.peerdb.cloud/"><strong>Try PeerDB Cloud for free.</strong></a></p>
</li>
<li><p><a target="_blank" href="https://github.com/PeerDB-io/peerdb"><strong>Visit PeerDB's GitHub r</strong>epo<strong>sitory to Get Started.</strong></a></p>
</li>
</ol>
]]></content:encoded></item></channel></rss>