Seamless and Scalable Streaming of Query results with PeerDB
When do you Stream Query results from one data store to another?
Streaming query results from one data store to another is common for many real-world use cases. Below are a few of them:
Continuously stream your transformed, filtered, and cleaned transactional data from PostgreSQL to the data warehouse for analytics.
Sync customer-specific data from a multi-tenanted/centralized data store to the customer’s own data store.
Stream pre-aggregated data from the data warehouse to PostgreSQL for high-concurrent, low-latency dashboards.
Why is it hard?
Building a scalable data pipeline to continuously stream query results across data stores involves multiple challenges:
Complex data-type mapping: You need to handle data-type mapping across data stores, which can become intricate when dealing with advanced data types like arrays, bytes, and vectors.
Reliable synchronization tracking: Ensuring accurate tracking of synced and unsynced data becomes crucial to maintain data consistency.
Fault-tolerant pipeline: Building a pipeline that can handle failures and recover seamlessly is not easy.
Maintain data freshness: You need to ensure that the pipeline is as performant as possible. This would dictate the freshness of data for downstream applications.
Resource management: You need to explicitly handle resource management on source and target. This is to ensure that your pipeline doesn’t affect other concurrent workloads.
Limitations with current tools and approaches
Lack of customizable and specialized ETL tools - Most ETL tools lack customizability to stream results based on any SELECT query( with joins, group bys filter etc). They are also generalized supporting a ton of connectors - not providing a reliable and scalable experience for any 2 connectors.
Businesses spend months of resources to build in-house pipelines: Orgs typically build in-house data pipelines using Airflow and Python scripts. They spend multiple months of effort and significant dev resources to build and maintain these pipelines.
PeerDB - Super easy, Blazing Fast and Highly Customizable
PeerDB makes it super easy to continuously stream query results from one data store to another. Below are a few benefits that you get by using PeerDB -
Save multiple months of engineering resources: You simply run a few SQL commands, and PeerDB takes care of all the heavy lifting to set up and maintain highly performant and reliable data pipelines across stores. You don’t need to spend months of time and resources to build pipelines. PeerDB reduces the effort to a few days.
Blazing Fast: PeerDB internally implements multiple optimizations to provide the best possible performance experience. For example, it converts data to Avro format during transit and enables parallelism during both reading from sync and writing to target. This enables you to take decisions using fresh data.
You can run any SELECT query that is supported by PostgreSQL for the transformation, including JOINs, function/procedure calls, GROUP BYs, and so on.
The SQL command provides various options such as batch size, parallelism, and refresh interval, which give you granular control when configuring the pipeline.
Detailed documentation on setting up Real-time Streaming of Query Results from PostgreSQL is available here.
Supported source(s) include PostgreSQL and targets incl. Snowflake, BigQuery, S3 and PostgreSQL.
If you wanted support for other data stores, join our slack channel and shoot us a request. OR send an email to firstname.lastname@example.org. Our engineering team operates fast and has built connectors just in a couple of days!
Hope you enjoyed reading the blog. Check out our github repo and get started testing PeerDB