How DBT helped us solve governance around BigQuery Scheduled Query at HelloFresh

Akash Ganguly
HelloTech
Published in
7 min readOct 31, 2023

--

At HelloFresh, we emphasize the importance of governance. Every pipeline we build goes through a set of checks via GitHub Actions, Unit Tests, Data Quality Checks etc. When HelloFresh was growing rapidly during early 2016 through 2022, a bunch of reporting dashboards were created using customer behavioral data on top of BigQuery data warehouse. Using Google BigQuery one can schedule queries to run on a recurrent basis and data analysts did not shy away from leveraging that.

However, as HelloFresh grew in data points collected and data teams from around the world, the solution around BigQuery Scheduled Query (BSQ) no longer met the needs. A lot of issues around governance were reported, for example, any analysts could update the SQL in production without going through a proper round of mandatory checks, leading to unwanted surprises in the final dashboard. During this time, Data Build Tool (DBT) was on the rise and we started investigating it as a possible solution.

In this blog, we aim to touch base on two topics around BigQuery and DBT, they are:

  • Migrating from BSQ and moving towards DBT
  • Making Google Analytics Data available outside of GCP

Introduction to the problem

BSQ[1] is very popular amongst the Data Analysts as it allows them to run their SQL on a recurrent basis, thus, cutting off a lot of dependencies on the Data Engineers.

But there were scenarios where we found it rather challenging to keep maintaining the queries via BSQ, for instance:

  1. Queries were maintained by more than one user, thus not able to track who changed what leading to ungoverned deployment.
  2. Use cases which required breaking of the SQL in multiple steps to improve query performance was not supported.
  3. Lacked the feature of automatic retry of failed runs.

Hence it was time to move away from BSQ but where to go? What to opt for?

Well, here comes everyone’s current favorite, drum roll, DBT[2]. We will come to that later, but let’s talk about the challenges around BSQ a bit more in detail.

In a company like HelloFresh, where data is growing in leaps and bounds by the day, the speed at which data insight is generated should be at par with the speed at which data is produced. We also need to ensure the means to produce data for analytical and operational use cases are robust and sustainable.

HelloFresh grew a lot during the 2019 pandemic, and access to sensible insights became the need of the hour. During this time, limited analysts across HelloFresh started creating BSQ on the go (given the fact that all you need is a SQL, and BSQ will run it recurrently with a few click setup). Eventually when the economy stabilized, and HelloFresh hired more Data Engineers and Analysts to deal with the enormous data more efficiently and cost effectively, we realized that many pipelines running as BSQ are ungoverned, failure prone and worth a king’s ransom.

There weren’t any processes to version control the SQLs, review the changes made on them, run unit tests on them or deploy via a proper CI/CD framework ensuring uninterrupted delivery.

What is the solution?

Sometime around last year (2022), HelloFresh was in the process of onboarding SnowFlake as the Data Warehouse (DWH) solution. During which, DBT with SnowFlake was in talks. At that time, DBT was still foreign to me and I started reading more about it. The fact that DBT can work with almost any DWH caught my eye. I was working closely on BigQuery during that time and was aware of the issues around BSQ (as discussed above).

Going over the DBT documentation several times, I could see it solving the very issue that was constantly being raised to us (Data Engineers) around the immense effort needed to maintain and manage BSQs.

Getting away from BigQuery Scheduled Query and moving towards DBT

Of course, how DBT works with BigQuery is not the scope of this blog. You can find that better documented here[3], but let’s focus on how DBT with BigQuery helped us solve the pain areas HelloFresh was facing.

  1. DBT leverages the data warehouse query engine to run the SQL queries. GoogleSQLs of the existing pipelines could be re-used.
  2. DBT uses GitHub to maintain the SQL based models. This ensures version control. No more hidden commits and a possibility to deploy changes via a proper CI/CD framework.
  3. DBT can run unit tests on fixtures via seed files[4] , cutting a lot of data warehouse compute cost. It also enables running SQLs in parallel on the selected DWH.
  4. Apart from this, generating documentation and producing lineage graphs are also features of DBT to name a few.

With enough evidence in hand, we started running BigQuery jobs via DBT Cloud. BTW, DBT Cloud provides out of the box scheduling, but we already had Airflow which we leveraged to schedule DBT Cloud jobs via DBT Airflow Cloud Operators[5].

Making Google Analytics Data available outside of GCP

HelloFresh is using Google Analytics to track customer behavioral data across the world. We receive near about 100M tracking events daily, which sums up to 50GB of disk space per day.

Google Analytics rolls up daily data to BigQuery in both intra day (which is almost real time) and daily tables. And like many of you, we also don’t use BigQuery as our primary data warehouse. We have a setup of AWS S3 as the data lake on top of which we also have a Kimball based DWH exposed via AWS GLUE.

(HelloFresh is also moving towards Snowflake (with Data Vault) but more about it in some other blogs.)

All of our operational and analytical assets reside in S3, thus having customer behavioral data in a different cloud doesn’t really help much. One of our major initiatives here in HelloFresh is to seek a cheaper solution for data transfer to provide reliable and scalable customer behavioral data in DWH.

We have been bringing data to S3 from BigQuery for quite some time now. Here’s what we have done so far:

The sequential behavior of the pipeline is:

  1. A daily Airflow DAG triggers the job on DBT Cloud (via DBTCloudOperator)
  2. DBT will submit the SQL to BigQuery (via API) and wait synchronously for the SQL to materialize output back to a BigQuery table/view.
  3. Once the DBT Job has finished, Airflow will trigger a spark job that runs on AWS EMR to sync data from the BigQuery table created above and write it to S3 (you may go for alternatives here).
  4. The Spark Job uses spark-bigquery-with-dependencies_2.12[6] library to make a connection with BigQuery to retrieve records to persist it on S3.

Expense Involved

In our case we run Airflow on Kubernetes and Spark jobs on EMR. These are used regularly to process/run thousands of DAGs across the company so I will not bother much about the charges involved there. DBT Cloud comes at a price but DBT Core is also a viable option which is open source (comment below if you want me to write about it). More about when to use DBT Cloud over DBT Core can be found here.

What we have to worry about is the BigQuery data transfer and compute cost:

BigQuery gets very costly when it comes to compute and outbound data transfer(egress) if not done mindfully. The processing cost can vary from $5/TB to $7/TB and egress cost can vary from $0.12/GB to $0.19/GB (dependent upon data center).

Read the Google BigQuery pricing document[7] for current pricing.

But let’s just take a random example where you need to scan 50GB data daily, which is, ($7/1024GB) x 50GB = $0.35/day.

Tip: Your SQLs should always have a BigQuery table partition key to reduce total data scanned.

In the image below, with “date” as partition key, we are scanning 42.04GB data.

However, without the partition key, you end up scanning 41.73TB data from the same table.

And let’s say we have to transfer 50GB data across continents, meaning, we pay somewhere around $0.19 x 50GB = $9.5/day.

Total per day cost ~ $10

Yearly ~ $3650

Things to keep in mind:

  1. Based on your DBT model, it may create temp tables on the go which will increase the compute price but we found this to be manageable at $0.35/day per 50GB table created (in this case).
  2. Occasional backfills or pipeline failures and reruns can affect the pricing. This is true for both BSQ and DBT.
  3. This is not the overall pipeline pricing but only the BigQuery compute and data transfer cost.

Note: BigQuey storage pricing is not taken into account as it is minimal.

Tip: GoogleSQL is very powerful. Functions like UNNEST can save a lot of engineering effort and time. I would highly recommend leveraging that unless your business model says otherwise.

What did we learn?

If you have read till here, congrats, I am sure you now know the following:

  1. BigQuery Scheduled Query is not the best tool to manage production use cases for a medium to large size company.
  2. Start learning DBT ASAP. Especially if you are an Analytical Engineer or Data Analyst, DBT is the tool for you.
  3. Getting Google Analytics data out of BigQuery is not a hassle and is affordable if done mindfully.

Closing Note

Thank you all for reading, I really appreciate you putting some time into this. We are doing more exciting stuff around Customer Behavior Data, BigQuery and DBT which will soon be published. Stay tuned, Keep reading, Keep learning.

References

[1]:https://cloud.google.com/bigquery/docs/scheduling-queries

[2]:https://docs.getdbt.com/docs/introduction

[3]:https://docs.getdbt.com/docs/core/connect-data-platform/bigquery-setup

[4]:https://docs.getdbt.com/docs/build/seeds

[5]:https://airflow.apache.org/docs/apache-airflow-providers-dbt-cloud/stable/operators.html

[6]:https://mvnrepository.com/artifact/com.google.cloud.spark/spark-bigquery-with-dependencies_2.12/0.18.1

[7]:https://cloud.google.com/bigquery/pricing

--

--