Better Data Operations with Snowflake

A not so uncommon morning!

Jane, a Data, BI Analyst, anxiously logs on at 7:00 a.m. every morning. Every day at 7:30 a.m., the BI tool emails the daily operations scorecard for their morning meetings to 100 analysts, managers, directors, and vice presidents. Jane tries to look at the daily operations scorecard and stop it from going out if needed. Her anxiety heightens as the report doesn't open. Unsure what to do next, she makes the dreaded call to her boss.  

Do you work with data? Does this horror story sound familiar? If so, then Snowflake might be a good choice for you.

 

How did this happen?

Here are a couple of potential causes that are all too common in data professionals' lives, any of which could've contributed to the above scenario. 

  • Someone accidentally drops a table in prod. Restoring that table would require restoring the entire DB in stage and finding the table. The restore will take one to three days.
  • Someone makes a change to a data flow in prod, and downstream it populates a different table with incorrect data. They didn't work out of dev or stage because the data hadn't been updated in months.
  • A report writer doesn't want to wait on the DW team to build out a table, so she writes her own SQL. She creates this report just before the end of the month. This report is used heavily during the first week of the month by the company's fraud detection team. She didn't realize the heavy impact it would have on the system, and the database slows to a crawl.

 

What Snowflake features could help?

UNDROP

Most data professionals have had “an oops” moment where we dropped a table that we didn't mean to drop. Some poor souls have even dropped entire schemas or databases. Part of Snowflake's magic is you can use the UNDROP function to recover lost objects that have been dropped in the last 24 hours. The first time my team needed this function, it took us 15 minutes to restore a few dropped tables. Most of that time was spent reading the documentation for the first time.

CLONE

Making copies of your data in lower lanes couldn't be easier. It is not uncommon for people to do development work in prod because the lower lanes are out of date. This inevitably leads to mistakes. Snowflake allows you to clone a database, schema, or table with one simple command.

The best part is that this process is unbelievably fast. I have a table that has 1.5 billion rows that I cloned in 2.5 seconds. One of my friends works at an IoT company. Once a week, they have prod scheduled to clone to stage, which takes only 15 minutes to complete.

Dynamic Scaling

Under the covers, your data sits in S3 and uses an EC2 cluster(s) to handle the query processing. This makes it easy to add more processing power at the click of a button or configure Snowflake to increase the number of the EC2 clusters based on need. The Snowflake mechanism that controls this process is called a virtual warehouse. This allows you to tailor your compute needs for specific use cases.

Suppose you have these use cases:

  • Analysts and business users creating/viewing reports on Tableau Server
  • Data scientists performing experiments
  • Data engineers streaming data into a data lake

Each of these use cases has different needs based on what they are doing. Snowflake allows you to set up virtual warehouses that tailor compute resources to the need.

Image
Table screenshot titled Create Warehouse

This configuration will only turn on the virtual warehouse when people are using it. If there is no activity for 5 minutes, then it shuts off automatically. It is also permitted to increase the number of clusters when queries start to queue up.

The advantage of this setup is that when the Tableau users are not using Snowflake on weekends, holidays, evening hours, etc., we are not paying for it. If a Tableau user needed to run a query on Christmas Eve at 3:00 a.m., Snowflake would automatically turn on for them and shut off after they are done. It will also automatically increase the number of clusters if a group of people hit the server all at once, such as between 8:00 a.m. and 9:00 a.m. when people get into the office or during monthly audits.

Image
Table screenshot titled Create Warehouse

This configuration will also turn on when people are using it and turn off when they aren't using it. However, it will not add clusters if a lot of queries hit it all at once. The few queries that it processes will run eight times faster than COMPUTE_TABLEAU_SERVER.

The advantage of this setup is that it allows a small team of data scientists to run massive queries quickly without impacting other groups or processes. Since most queries run just fine on an x-small warehouse, there is no need to give other use cases larger sizes.

Image
Table screenshot titled Create Warehouse

This configuration makes it so that the virtual warehouse is always turned on and has two X-Small clusters. The advantage of this set up is that other processes won't disrupt the stream. If suddenly, 50 Tableau users decided to overwhelm the DB with 200 expensive queries, our stream would not be impacted. Additionally, the virtual warehouse is never turned off, so it will always be available to process data quickly.

 

Better Data Operations for Jane!

Had Jane been using Snowflake, she would have avoided any of the potential scenarios that prevented the daily operations scorecard from going out. When a table is accidentally dropped, it could have been restored quickly. People would not be developing in prod because stage and dev haven't been updated in months. When the fraud detection team did their monthly audits, Snowflake would have scaled automatically and then scaled back down once they were done.

Snowflake is an incredible tool for data warehousing. Snowflake solves many DW problems in ways that other databases simply cannot do.

 

Click here to read this in Spanish.

Author: Jake Anderson, Data Scientist