Automating Data Pipelines: Deploying Azure Databricks with Terraform

Automate Databricks Deployment by Terraform | Nitor Infotech
×

About the author

Aparna Jore
Junior Software Engineer
Aparna Jore is a Junior Software Engineer at Nitor Infotech. She excels in Big Data technologies and Generative AI, emphasizing scalable data pi... Read More

Big Data & Analytics   |      29 May 2024   |     13 min  |

Deploying Databricks on Azure is crucial for organizations to harness the potential of data analytics and machine learning. Yet, the current manual, time-consuming deployment process often leads to errors and inconsistencies, impacting productivity and risking deployment failures.
To overcome these challenges, an efficient and automated approach is essential for provisioning and managing Databricks resources on Azure. Thanks to Terraform, a popular open-source infrastructure-as-a-code tool that helps in safe, predictable and efficient provisioning and management of infrastructure resources!

Note: Read how we helped an oil & gas company in meeting compliance standards with Azure.

This blog will introduce you to Terraform and its benefits. Additionally, you’ll discover the steps needed to deploy Azure Databricks using Terraform.

What is Terraform and how does it make a difference?

Terraform, developed by HashiCorp, allows you to define infrastructure as code. This means that you can write and control the version of your infrastructure configurations.

This approach offers several advantages, including repeatability, consistency, and the ability to track changes over time.

Here are the advantages for you to understand why Terraform can stand out as the right choice for you:

1. Multi-cloud Support: Terraform is a multi-cloud provisioning tool, supporting various providers like Azure, AWS, and Google Cloud. This allows consistent management across different cloud environments, facilitating multi-cloud or hybrid cloud strategies.

2. Declarative Language: Using a declarative language, it defines infrastructure configurations, abstracting the complexity of managing individual resources and their dependencies. This simplifies infrastructure management.

3. Preview Changes: It provides a plan command to preview infrastructure changes before applying them. This helps identify potential issues, reducing the risk of downtime or misconfigurations.

4. Infrastructure as Code: It allows defining infrastructure as code, providing a clear, version-controlled representation of the desired infrastructure state. This improves reproducibility and collaboration.

5. Automatic Dependency Management: It automatically manages dependencies between resources, ensuring that changes are applied in the correct order, reducing errors and ensuring consistency.

Clear with the basics? Great!

collatral

Learn how migrating your digital assets to Azure can secure your business in 2024.

Now, I’ll guide you through the process of deploying Azure Databricks using Terraform so that you can streamline your infrastructure provisioning and management with ease.

Deploying Azure Databricks with Terraform: Step-by-Step Guide

The following steps are necessary for the deployment process:

1. Installation of Hashicorp Terraform:

First, start by installing Hashicorp Terraform for Windows.

2.Authentication to azure cloud from Terraform (Authenticating using a Service Principal with a Client Secret):

To authenticate using a Service Principal with a client secret, you need to create a new file named main.tf and include the following information:

Terraform Code 1

Next, create another new file with name variable.tf to declare the variables:

Terraform Code 2

Also create a new file databricks.auto.tfvars and assign the values to the above variables in that file.

Note: Replace the values of client_id, client_secret, tenant_id and subscription_id with actual values in the code given below:

my_client_id = "0000-0000-0000-0000-0000-00000"

client_secret_key = "0000-0000-0000-0000-0000-00000"

my_tenant_id = "0000-0000-0000-0000-0000-00000"

my_subscription_id = "0000-0000-0000-0000-0000-00000"

3. Providing the application with permissions to manage resources in your Azure Subscription:

Create an application in Azure Active Directory (Microsoft Entra ID -> App registrations) and assign the contributor role to the application.

4. Create a new Terraform file for deploying resource group to Microsoft azure:

Here, you need to create a Terraform file with the name Resource_group.tf:

Terraform Code 3

5. Create a Terraform file for deploying Databricks workspace to Microsoft azure:

Use this code to create the file:

Terraform Code 4

6. Create a Terraform file for deploying Databricks Cluster to Microsoft azure:

Use this code to create the file:

Terraform Code 5

7. Create a Terraform file for deploying Databricks Notebook to Microsoft azure:

  • First create a notebook_blob_storage.py file and add the following code in that file.
  • This code snippet is designed to extract data from a file stored in an Azure Blob Storage container. Additionally, we can incorporate code to apply various transformations and actions within this notebook.

Terraform Code 6

  • Add the following variables into the variable.tf file which you have created earlier:

Terraform Code 7

  • Also assign the values to the above variables in databricks.auto.tfvars file, which you have created earlier:

Terraform Code 8

Now, this will be the last step to deploy Databricks notebook.

To do so, create a new Terraform file with the name databricks_notebook.tf:

Terraform Code 9

8. Create a Terraform file for deploying Databricks job to Microsoft azure:

Terraform Code 10

Then add a new variable in databricks.auto.tfvars:

Terraform Code 11

Bonus Read:

Keep a note of these Terraform commands:

  • terraform init: This command prepares the working directory for Terraform, including backend initialization, child module installation, and plugin installation.
  • terraform plan: The plan command generates an execution plan outlining the actions Terraform will take without performing those actions.
  • terraform apply: This command creates or updates infrastructure based on the configuration files. It typically generates a plan first, which needs to be approved before applying the changes.
  • terraform destroy: The destroy command removes the infrastructure created by Terraform apply, effectively deleting the resources provisioned.

9. Automate the deployment process using Python script:

You can use this Python script to automate the complete azure databricks deployment process using Terraform:

Terraform Code 12

By the end of these 9 steps, you’ll be able to deploy Azure Databricks successfully with Terraform.

If you’re now curious about the cost structure of Databricks and Terraform, take a moment to read the following section to familiarize yourself with it!

Pricing: Databricks and Terraform

1. Databricks Standard Tier Pricing:

The pricing for Databricks can vary based on your specific use case and the type of instances you choose. In the above case, we conducted a proof of concept (POC) and did not incur any costs. However, based on utilization, the cost may be as follows:

https://azure.microsoft.com/en-in/pricing/details/databricks/

2. Terraform Pricing Overview:

  • Deploying Databricks on cloud infrastructure using services like Azure Databricks may involve costs related to virtual machines, storage, data processing, networking, and additional services utilized within the cloud environment.
  • On the other hand, using a managed service like Azure Databricks may involve costs for the service itself, which typically includes a combination of compute resources, storage, data processing, management features, and support from the provider.

Given below are tier-wise specifications:

  • Free Tier: Terraform offers a free tier that allows users to manage up to 500 resources per month at no cost.
  • Standard Tier: Beyond the free tier, users are charged at a rate of $0.00014 per hour per resource.
  • Plus and Enterprise Tiers: Pricing for the plus and enterprise tiers is not provided publicly and requires contacting the sales team.

I hope I have provided all the necessary details for you to navigate across the Terraform terrain seamlessly.

Note: While deploying Azure Databricks with Terraform, it’s important to note some limitations. Terraform offers limited resource support for Azure Databricks compared to other Azure services, and certain features or configurations may not be fully supported, requiring alternative approaches.

So, its effectiveness in deploying Azure Databricks can vary depending on specific requirements and use cases. It is advised to assess your needs and then act based on them.

You can reach out to our team of experts at Nitor Infotech who can assess your business and provide the best big data solutions.

subscribe image

Subscribe to our
fortnightly newsletter!

we'll keep you in the loop with everything that's trending in the tech world.

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.