# Reducing BigQuery Costs by 260x

In this blog post, we'll do a deep-dive into a simple trick that can reduce BigQuery costs by orders of magnitude. Specifically, we'll explore how [**clustering**](https://cloud.google.com/bigquery/docs/clustered-tables) (similar to indexing in BigQuery world) large tables can significantly impact costs. We will walk through an example use-case for a common query pattern ([MERGE](https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement)), where clustering reduces the amount of data processed by BigQuery from 10GB to 37MB, resulting in a cost reduction of ~260X.

## Setup

The setup involves a common Data Warehousing scenario of transforming raw data from multiple sources into a consumable format through the MERGE command. Now, let's delve into the data model:

1. There are two groups of tables, each containing two tables. One group of tables is clustered based on the `id` column, while the tables in the other group are not clustered.
    
2. One of tables represents staging/raw data and the other represents the final data.
    
3. The staging table is seeded with 10 million rows and the final table is seeded with 200 million rows.
    
4. The `MERGE` command is used to merge a subset of data (~10K rows filtered on `id`) in the staging table to the final table.
    

### Creating Tables and Seeding Data

Here goes the [script](https://gist.github.com/saisrirampur/22b3399077cf8490301039aa5ca7ab94) that we used to create and seed tables.

### MERGE command on Clustered and Unclustered tables

```sql
/*
<final_table> and <staging_table> 
can be either both clustered or unclustered
*/
MERGE sai_tests.<final_table> AS final
USING (
    SELECT * FROM sai_tests.<staging_table>
    -- filtering 9999 rows from staging table
    WHERE id > 500000 AND id < 510000 
) AS staging
ON final.id = staging.id -- join on id column
WHEN MATCHED  THEN
    UPDATE SET final.column_string = staging.column_string, 
    final.column_int = staging.column_int, 
    final.column_float = staging.column_float, 
    final.column_timestamp = staging.column_timestamp
WHEN NOT MATCHED  THEN
    INSERT (id, column_string, column_int, column_float, column_timestamp)
    VALUES (staging.id, staging.column_string, staging.column_int, 
    staging.column_float, staging.column_timestamp);
```

## Analyzing the MERGE command on the Unclustered tables

Below is the execution graph of the MERGE command on the unclustered tables:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707094507323/3ad7d497-77d2-4092-8811-916737fc3986.png align="center")

**Now let us analyze important step of the execution graph:**

1. In the first step (S00), BigQuery infers the join across both tables and pushes the `id` filter to the final table. Despite pushing the filter down to the final table, BigQuery still reads all 200 million records to retrieve the approximately 10K filtered records. This is due to the absence of clustering on the 'id' column in the final table.
    
    1. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707094752871/bda203c8-ab63-42df-878d-fa8ea617aaff.png align="center")
        
2. Now, let us analyze the JOIN (S02). As a part of the JOIN, BigQuery scans the entire 10 million records in the staging table, even though there is a WHERE clause on the `id` that filters approximately 10K records. This is due to the absence of clustering on the `id` column in the staging table.
    
    ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707094806579/abb6ad06-a639-4c16-9673-d1867f3d0ac0.png align="right")
    
3. If you observe, lack of CLUSTERING made BigQuery generate suboptimal plan processing 9.69GB of data costing 4.7 cents.
    
    1. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707095086689/0f00e633-3e7a-4d7d-9b0a-f7ddeafb38fa.png align="center")
        
        ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707095414333/45438275-b8b1-4067-a098-79ca3b692fd4.png align="center")
        
        ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707095157716/7492fd44-41f7-43b3-a4f8-c8d7668de3c2.png align="center")
        

## Analyzing the MERGE command on the Unclustered tables

Below is the execution graph of the MERGE command on the clustered tables:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707093206188/e8776677-56ab-43d0-9781-c9e90402ccc4.png align="center")

**Now let us analyze each step of the execution graph:**

1. First, BigQuery infers the join across both tables and pushes the `id` filter to the final table. Since the final table is CLUSTERED on the `id` column, BigQuery leverages this clustering to efficiently retrieve approximately 368K records out of the 200 million records in the final table.
    
    1. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707093370991/955c73e2-2854-482d-97ca-414037e49ca5.png align="center")
        
2. Now, let's analyze the JOIN (S02). As part of the JOIN, BigQuery utilizes the CLUSTERED `id` to read only 362K records from the staging table, which contains 10 million records.
    
3. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707093521527/ed3e790b-f52b-42f3-a836-cd1083da626c.png align="center")
    
    If you observe BigQuery optimally filtered data across both the staging and final table using the `CLUSTER` (index) on id processing only ~37MB of data costing 0.017 cents
    
4. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707094093894/91417a6e-f39d-45ce-a0dd-6f9739571d43.png align="center")
    
    ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707095440054/ea78d9b2-e7e0-4b5b-8169-12897679d30e.png align="center")
    
    ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1707094185157/5851a167-953e-44fe-96e4-c01dbc9e4133.png align="center")
    

## Results: Unclustered vs Clustered Tables

|  | Unclustered | Clustered | Difference |
| --- | --- | --- | --- |
| **Bytes Processed** | 9.69GB | 37MB | ~260X |
| [**Slot time**](https://cloud.google.com/bigquery/docs/slots) | 40.5min | 2.5min | ~20x |
| [**Costs**](https://cloud.google.com/bigquery/pricing) | 4.7 cents | 0.017 cents | ~268X |

1. The amount of bytes processed by the MERGE on the clustered set of tables is 260 times lower than the unclustered group.
    
2. Slot time is 20 times lower on clustered tables compared to unclustered tables.
    
3. Costs ($$) are 268 times lower for MERGE on clustered tables compared to unclustered tables.
    

This is the impact a simple CLUSTER can have on compute costs for BigQuery. Based on the MERGE commands that you use, observe the join columns and columns in the WHERE clause, and intelligently [CLUSTER](https://cloud.google.com/bigquery/docs/creating-clustered-tables) tables on those columns. This can signficantly reduce costs. With a similar approach.

## Auto Clustering and Partitioning in PeerDB

We at [PeerDB](https://www.peerdb.io/) are building a fast and cost-effective way to replicate data from Postgres to Data Warehouses and Queues. Aligned with the above blog post, we [automatically cluster and partition](https://github.com/PeerDB-io/peerdb/pull/915) the raw and the final tables on BigQuery. The raw table is cluster based on 2 columns that have WHERE clause filters and partition based on timestamp. The final table is clustered based on the primary key. We've seen significant cost reduction (2x-10x) for our customers with this optimization.

## References

Hope you enjoyed reading the blog, here are a few references that you might find interesting:

1. [Shopify reduced their BigQuery spend from ~$1,000,000 to ~$1000](https://shopify.engineering/reducing-bigquery-costs).
    
2. [Five Useful Queries to get BigQuery costs.](https://blog.peerdb.io/five-useful-queries-to-get-bigquery-costs)
    
3. [Try PeerDB Cloud for free.](https://app.peerdb.cloud/)
    
4. Visit PeerDB's [**GitHub**](https://github.com/PeerDB-io/peerdb) repository to Get Started.
