Snowflake vs. Databricks SQL Warehouses

About the author

Nitor Infotech Blog

Nitor Infotech is a leading software product development company serving ISVs, enterprises, and private equity firms globally.

Big Data & Analytics | 14 Aug 2024 | 19 min |

If you’re wondering what makes some businesses scale faster and better, well the answer is obtaining accurate insights and maintaining data integrity. For those looking to play the long game, my recommendation is to switch to SQL warehouses for SQL-specific workloads. SQL warehouses offer a centralized, efficient, and reliable way to manage your data, ensuring consistency and performance.

Although Snowflake and Databricks stands tall amongst the rest of warehouses, this blog will help you choose the right SQL warehouse for your data-driven projects. I’ll delve into the functionalities, warehouse types, performance benchmarks, and cost structures of Snowflake and Databricks SQL warehouses, helping you make an informed decision in 2024.

So, let’s get started with the basics!

Difference between Snowflake and Databricks

	Snowflake	Databricks
Overview	Snowflake stands out as a fully managed, cloud-based data warehousing solution. It allows you to store, analyse, and gain insights from massive datasets with ease.	Databricks is a unified platform for building, deploying, and managing your entire data analytics ecosystem. It seamlessly integrates with cloud storage and security services within your chosen cloud account.
Architectural Layer	Contains a three-tier design: centralized storage multi-cluster compute cloud services	These are the main parts: delta lake delta engine additional integrated tools that facilitate BI reporting, Data Science, and MLOps
Architecture features	Snowflake is available on AWS, Azure, or GCS Data is stored in the Snowflake storage Data can be retrieved from Azure Blob Storage, S3, or GCS. Data loaded to Snowflake is indexed and partitioned during ingestion De-coupled compute and storage. Virtual warehouses (VWH) can be quickly scaled using either the SQL command line or a web-based interface, and they can also be set to auto-scale.	Databricks is a core service within Azure. It is also accessible on AWS. Delta lake sits on top of your existing data lake, delivering reliability, security, and performance Can be accessed by SQL / ML layers Data can be accessed from Amazon S3, Azure Storage, or GCS

Snowflake

Databricks

Overview

Snowflake stands out as a fully managed, cloud-based data warehousing solution. It allows you to store, analyse, and gain insights from massive datasets with ease.

Databricks is a unified platform for building, deploying, and managing your entire data analytics ecosystem. It seamlessly integrates with cloud storage and security services within your chosen cloud account.

Architectural Layer

Contains a three-tier design:

centralized storage
multi-cluster compute
cloud services

These are the main parts:

delta lake
delta engine
additional integrated tools that facilitate BI reporting, Data Science, and MLOps

Architecture features

Snowflake is available on AWS, Azure, or GCS
Data is stored in the Snowflake storage
Data can be retrieved from Azure Blob Storage, S3, or GCS.
Data loaded to Snowflake is indexed and partitioned during ingestion
De-coupled compute and storage.
Virtual warehouses (VWH) can be quickly scaled using either the SQL command line or a web-based interface, and they can also be set to auto-scale.

Databricks is a core service within Azure. It is also accessible on AWS.
Delta lake sits on top of your existing data lake, delivering reliability, security, and performance
Can be accessed by SQL / ML layers
Data can be accessed from Amazon S3, Azure Storage, or GCS

With the basics covered, let’s explore some of the advanced aspects of Snowflake and Databricks.

Comparing Advanced Features: Snowflake vs. Databricks

Aspects	Snowflake	Databricks
Language support	Java, JavaScript, Python, Scala, and SQL	Python, Scala, R and, SQL, Apache Spark. Only SQL when SQL warehouse is being used.
Data Streaming	Possible with Snowpipe and Streams. Supports Kafka connector.	Automates streaming data ingestion and transformation with StreamSets.
Complexity and Learning Curve	Familiar SQL interface, shorter learning curve	Rich ecosystem, steeper learning curve
AI/ML	Snowflake’s AI/ML services are still evolving and are not as mature as those offered by Databricks. However, Snowflake provides various tools for AI/ML development, including: Snowpark Snowpark Container Services Snowflake Cortex Snowflake is advancing in cutting-edge technologies such as generative AI to improve data-driven decision-making. Recently, Snowflake has acquired three companies to boost its AI capabilities: Neeva (a search company) Streamlit (specializing in LLM-powered generative AI applications) Applica (focused on deep learning)	Databricks’ Managed MLflow, built on the open-source MLflow platform, provides comprehensive machine learning lifecycle management.This includes experimentation, reproducibility, deployment, and a central model registry, all with enterprise-grade reliability, security, and scalability. Key features include: Built-in Spark Managed MLFlow ML Runtime Collaborative Notebooks Feature Store AutoML
Time Travel Feature	Available – Time Travel	Available – Delta Time Travel
RBAC	Available	Available
Row & Column level Security	Available – Data Masking	Available
Data Replication and Cloning	Zero Copy Cloning	Delta Cloning – Shallow Copy and Deep Copy
Replication and Failover	Fail safe available from enterprise edition onwards from 7-90 days	Disaster recovery is not pre-configured and needs to be set up by the user. It provides guidelines and best practices for setting up disaster recovery.
Micro-Partitioning & Data Clustering	Preconfigured and performs better than Databricks.	Preconfigured when delta tables are used but are not as effective as Snowflake. It needs more optimization techniques implementation.

Onwards to the variants of Databricks and Snowflake!

Types of Warehouses: Databricks vs. Snowflake

When it comes to Databricks warehouses, there are four variants of warehouses. They are:

All-Purpose Compute: This is best suited for batch processing using SQL queries. It offers greater control over cluster configurations but comes at a higher cost than job compute.
Job Compute: This is ideal for running specific jobs. It allows fine-grained control over cluster resources and automatically terminates after job completion, minimizing idle resource costs.
Compute with Photon: This high-performance offering leverages the Photon vectorized query engine to accelerate SQL workloads and DF API calls, leading to faster execution and potentially lower overall costs.
Interactive Clusters: This is optimized for interactive queries and reading data.

Each of the above variants provides varying levels of performance and scalability based on your requirements.

In contrast, Snowflake offers two types of warehouses:

Standard warehouses: are suitable for general data warehousing tasks.
Snowpark optimized warehouses: cater specifically to workloads leveraging Apache Spark software and Snowpark for enhanced performance.

Both provide compute resources (CPU, memory, temporary storage) and can be scaled up or down on demand to match your workload needs.

Did you know?

Snowflake and Databricks offer warehouses in various sizes. Here’s a breakdown:

	SQL warehouse sizes
Snowflake	–	XS	S	M	L	XL	2XL	3XL	4XL	5XL	6XL
Databricks	2XS	XS	S	M	L	XL	2XL	3XL	4XL	–	–

Refer to the hourly costs of the enterprise edition for both platforms:

Platforms	2XS	XS	S	M	L	XL	2XL	3XL	4XL	5XL	6XL
Snowflake $/hour	–	$3	$6	$12	$24	$48	$96	$192	$384	$768	$1536
Databricks $/hour	$2.8	$4.2	$8.4	$16.8	$28	$56	$100.8	$190.4	$369.6	–	–

(I’ll talk about the cost structure in detail in the latter half of the blog. Keep reading!)

Learn how our data warehouse and dashboards helped a leading retail chain transform its strategies by understanding customer behavior with easy visualizations.

Download Case Study

Moving ahead, I’ll provide a performance comparison of both platforms based on a recent assessment.

Performance Benchmarking: Unveiling the Powerhouse

To objectively assess performance, my team and I conducted a benchmarking exercise using the TPC-H dataset, a widely recognized decision support benchmark. This dataset includes a suite of business-oriented ad-hoc queries with varying complexity levels.

We utilized a 100GB dataset size for the comparison. You can find more details about the TPC-H benchmark on the TPC website.

A. Here is an overview of the query complexities:

The chosen queries ranged from low to high complexity, encompassing tasks like:

Pricing Summary Reports (Low)
Identifying Minimum Cost Suppliers (Medium)
Analysing Shipping Priorities (Medium)
Order Priority Checks (Medium)
Local Supplier Volume Analysis (High)
Volume Shipping Queries (Medium)
National Market Share Exploration (Low)
Product Type Profit Measurement (High)
Returned Item Reporting (High)

B. Performance optimization techniques:

We explored the following techniques for both:

Snowflake	Databricks
i. Micro-Partitioning:Data is partitioned along small files ranging 50-500 mb.Metadata is maintained for each partition for faster data retrieval.	i. Delta Tables:Leverages delta lake and u ses Parquet format. Thus, enabling optimization techniques.
ii. Data ClusteringIt manages clustering metadata for the micro-partitions within a table, including: total no. of micro-partitions no. of partitions which overlap depth of the overlapping partitions	ii. Optimize, Z-order: Reduces no. of files Groups related data within the same set of files Used for high-cardinal fields.
iii. Auto-ClusteringSnowflake offers a service that automatically and continuously manages the reclustering of clustered tables as needed. It only performs reclustering when it will provide a benefit to the table.	iii. Partitioning: It stores similar data within the same set of files, facilitating faster data retrieval, and is used for low-cardinality fields.

Snowflake

Databricks

i. Micro-Partitioning:Data is partitioned along small files ranging 50-500 mb.Metadata is maintained for each partition for faster data retrieval.

i. Delta Tables:Leverages delta lake and u
ses Parquet format. Thus, enabling optimization techniques.

ii. Data ClusteringIt manages clustering metadata for the micro-partitions within a table, including:

total no. of micro-partitions
no. of partitions which overlap
depth of the overlapping partitions

ii. Optimize, Z-order:

Reduces no. of files
Groups related data within the same set of files
Used for high-cardinal fields.

iii. Auto-ClusteringSnowflake offers a service that automatically and continuously manages the reclustering of clustered tables as needed. It only performs reclustering when it will provide a benefit to the table.

iii. Partitioning: It stores similar data within the same set of files, facilitating faster data retrieval, and is used for low-cardinality fields.

Note: Apart from the above techniques for Databricks, you can follow these Spark configurations too:

·Spark.executor.cores
·Spark.executor.instances
·Spark.executor.memory
·Spark.sql.adaptive.enabled
·Spark.sql.maxPartitionBytes – Default (128 mb)
·Sc.default.parallelism

C. Results and observations:

Before optimizations, Databricks execution times for TPC-H queries varied from 30 to 40 seconds, with some complex queries taking even longer.

After implementing the optimization techniques mentioned above, we observed a significant improvement in query execution times. The average execution time dropped to around 20 seconds, with a maximum reduction of 64% achieved for a specific query.

However, it’s important to note that Snowflake, even without extensive optimization, generally delivered faster execution times compared to unoptimized Databricks configurations. This could be attributed to Snowflake’s inherent architecture and query processing engine.

Test results: We compared warehouses ranging from “XS to 4XL” for both platforms. Refer to the images below for clarity:

Cluster Size - S

Cluster Size – S

Cluster Size - XS

Cluster Size – XS

Cluster Size - L

Cluster Size – L

Cluster Size - M

Cluster Size – M

Cluster Size - XL

Cluster Size – XL

Cluster Size - 2XL

Cluster Size – 2XL

Cluster Size - 3XL

Cluster Size – 3XL

Cluster Size - 4XL

Cluster Size – 4XL

The data above shows that Databricks initially lagged behind Snowflake’s performance up to the medium warehouse size. However, as the warehouse size grew, Databricks started to perform better and eventually surpassed Snowflake in some cases.

Cost Comparison Month Considering all Queries

Fig: Snowflake vs. Databricks – Cost Comparison per month

Let’s now compare the cost structures across different editions and sizes of warehouses!

Cost Comparison: Finding the Right Balance

Evaluating cost is another crucial factor when choosing an SQL warehouse. Here’s a breakdown of the pricing structures for both platforms:

Databricks Pricing:

Due to the absence of publicly available pricing information for Databricks editions (Standard, Premium, Enterprise), a direct cost comparison with Snowflake is challenging. However, Databricks utilizes a tiered structure based on:

Warehouse Size: Warehouses are sized based on Data Processing Units (DPUs), which represent compute resources. The cost varies depending on the chosen DPU tier (e.g., X-Small, Large, etc.). Pricing per DPU/hour can vary based on factors like cloud platform, instance type, and commitment terms. You can find more information on Databricks pricing on their website.
Spot Instances: Databricks allows utilizing spot instances for cost optimization. These offer lower costs but come with the potential for resource interruption.

Snowflake Pricing:

Snowflake offers a clearer pricing structure based on:

Edition: Standard, Enterprise, and Business Critical editions are available, each with varying features and support levels. The cost per credit/hour increases with higher editions.
Warehouse Size: Like Databricks, Snowflake charges based on credit per hour for warehouse sizing, with costs scaling according to the size of the deployed virtual warehouse.

Choosing the Right SQL Warehouse

To be honest, the choice between Snowflake and Databricks SQL Warehouse hinges on your specific data processing and analytics needs.

a. Choose Databricks if:

Spark-based analytics and Machine Learning are your core activities: Databricks’ heritage in machine learning development offers a mature environment for these tasks.
You have a skilled team for optimization: Databricks requires more configuration for optimal performance, but skilled users can unlock its significant potential.
Handling large and complex workloads is crucial: While our testing showed Snowflake’s advantage for smaller workloads, some resources suggest Databricks might excel with massive datasets.

b. Choose Snowflake if:

Ease of management and pre-optimized performance are priorities: Snowflake’s simpler setup and out-of-the-box performance are ideal for business intelligence (BI) tools and basic analytics.
Cost-effectiveness is a major concern: Snowflake’s predictable pricing structure can be advantageous, especially for workloads requiring minimal configuration.

This above analysis is based on a free-tier evaluation, and your results might differ based on workload specifics and data size. However, you can consider these factors and use the free trials to make a well-informed decision for your business.

So, start to leverage big data and boost your analytical capabilities with Nitor Infotech. We’d also love to hear your thoughts on this blog—feel free to reach out and share your feedback.

Previous Blog Next Blog

Recent Blogs

Agentic Data Engineering for ISVs: From ETL to Self-Healing Pipelines

Big Data & Analytics

AI-Augmented Decision Making: The Gap Between Accuracy and Judgment

Artificial intelligence

Data Pipelines to Data as a Product: Why Modern Enterprises Need a Product-Driven Data Strategy