Hands-on Apache Druid: Data Ingestion Steps

About the author

Rushikesh Pawar
Junior Software Engineer

Rushikesh Pawar is a Trainee Software Engineer at Nitor Infotech. He is a passionate software engineer specializing in data engineering, ade... Read More

Big Data & Analytics | 19 Jun 2024 | 11 min |

Are you in search of a solution that offers high-performance, column-oriented, real-time analytics? How about a data store that can handle large volumes of data and provide lightning-fast insights? Well, Apache Druid can do it all for you. Before proceeding with this blog, I strongly recommend that you read my previous blog about Apache Druid, to get a complete overview about its features, architecture, and comparisons with other open-source database management systems.

Done reading? Great!

Now in this blog, you will dive into the world of Apache Druid and explore the step-by-step process of installing and setting up this cutting-edge technology. You will also delve into the intricacies of data ingestion, understanding how to seamlessly bring data into Apache Druid for data analysis.

By the end of this blog, you will have a fully functional Apache Druid cluster ready to handle real-time analytical needs for your business.

Prerequisites before installation

Before we dive into the details, it’s important to ensure you have the necessary prerequisites in place. Quickly grasp what you’ll need:

Java Development Environment: At the foundation, you’ll require a Java Development Kit (JDK) version 8 or higher installed on your system. The JDK provides essential tools for developing and testing Java applications.
Operating System Familiarity: A solid understanding of Linux or Unix-based operating systems is crucial. These platforms often form the backbone of server environments, and being comfortable with their command-line interfaces will be highly valuable.
Big Data Infrastructure: Familiarity with distributed file systems, particularly the Apache Hadoop Distributed File System (HDFS), is important. HDFS is designed to handle large datasets efficiently on commodity hardware, making it a key component in advanced analytics applications.
Data Formats: A basic understanding of SQL and JSON data formats is required. SQL is the standard language for managing data in relational database management systems, while JSON is a popular format for data interchange, especially in web applications.
Streaming Platform: A fundamental knowledge of Apache Kafka, a distributed streaming platform, will also be beneficial. Kafka is widely used for real-time data processing, so having some familiarity with it will be advantageous.

Got your basics ready? Awesome! You are now set to embark on this journey of installation and data ingestion with Apache Druid and discover how it can revolutionize your data analytics workflows.

Quick Note:

Learn how we helped a leading retail chain optimize sales and marketing functions with our Dashboarding & BI solution, driving actionable insights for increased effectiveness.

Download Case Study

Deploying Apache Druid on a single server and connecting it to Kafka for real-time data ingestion can be achieved by following a few steps.

Let’s explore these steps in the next section!

14 Steps to Deploy Apache Druid with Kafka for Real-Time Data Ingestion

Step 1: Install Java

Ensure that Java is installed on your system as it is essential for running Apache Druid.

Install Java

Step 2: Verification

Ensure that both Java and Python are installed.

Verification

Step 3: Get Apache Druid

Download the Apache Druid tar file from the official website.

Get Apache Druid

Step 4: Extract the downloaded file

Extract the contents of the downloaded tar file to a directory on your system.

Extract the downloaded file 1

Extract the downloaded file 2

Step 5: Set Environment Variables

Set the JAVA_HOME and DRUID_HOME environment variables in your Linux.bashrc file to point to the Java and Druid installation directories, respectively.

Set Environment Variables

Step 6: Start Druid

Initiate the Apache Druid service by executing the “start-micro-quickstart” command. This command allocates 4 CPUs and 16 GB of RAM to Druid.

Once started, access the Druid web console by copying the provided link into your browser.

Start Druid

Step 7: Load Data

In the Druid web console, navigate to the “load data” section and choose “start a new streaming spec”.

Load Data

Step 8: Connect to Kafka (here data is consumed from Kafka)

Select Apache Kafka as the data source and then click on Connect data.

Connect to Kafka

Step 9: Configuration

Specify the Bootstrap Servers and Kafka Topic details. Click “Apply” and then “Next” to proceed.

Configuration

Step 10: Data Parsing

Once the data starts loading, check the following details according to data format, which in this case is JSON.

After disabling the “Parse Kafka metadata” option, click Apply to view the data in a tabular format. Then click Next.

Data Parsing

Step 11: Data transformation

After clicking ‘Next’ a few times, you will reach the data transformation options.

In the data transformation phase, you can perform column transformations, wherein you will add a new column named “temp_F”. To accomplish this, navigate to the “Add column transform” option, where you’ll be prompted to input details such as the name of the column.

Keep the default type as “expression” and proceed to write an expression that calculates the values for the new column.

In this instance, we are converting Celsius to Fahrenheit. Once the expression is defined, the new column will be seamlessly incorporated into the dataset.
Data transformation 1

Data transformation 2

Data transformation 3

Step 12: Data segmentation

Now, we need to select the data segmentation criteria to create the data segment.

Data segmentation

Step 13: Finalize and Submit

After navigating through several screens by clicking ‘Next’, click on the ‘Submit’ button.

Once data ingestion is complete, navigate to the “data source” tab in the Druid web console to view details of the ingested data source.

Finalize and Submit

Step 14: Data Exploration

Navigate to the “Query” tab in the Druid web console to explore and query the ingested data.

Data Exploration

That’s it!

By following the 14 steps above, you will successfully deploy Apache Druid with Kafka for real-time data ingestion.

As a recap, here are a few important things to keep in mind when installing Druid:

Ensure Python and Java are installed.
Configure environment variables like DRUID_HOME and JAVA_HOME.
Launch Druid with the correct command for your computational needs.
Choose partitioning and segmentation criteria based on your data volume and velocity to avoid segment issues.

In a nutshell, Apache Druid is a powerful tool that helps businesses make better decisions using real-time data. It’s fast, scalable, and flexible, making it ideal for tasks like interactive analytics, operational monitoring, and personalized recommendations. With its ability to handle both historical and real-time data, Apache Druid is transforming how businesses use data to drive success.

Now, it’s time to unleash the power of Apache Druid and unlock the full potential of your data analytics workflows. Feel free to reach out to Nitor Infotech with your thoughts about this blog.

Till then, happy exploring!

Previous Blog Next Blog

Recent Blogs

Why Traditional Data Approaches Fail and How Data as a Product Fixes It

Big Data & Analytics

Secure .NET Microservices with Keycloak, OAuth2 & IAM Setups

Software Engineering

What is Data as a Product?