What is Apache Druid? When to use it?

About the author

Rushikesh Pawar
Junior Software Engineer

Rushikesh Pawar is a Trainee Software Engineer at Nitor Infotech. He is a passionate software engineer specializing in data engineering, ade... Read More

Big Data & Analytics | 22 May 2024 | 17 min |

Modern times demand real-time data insights at a large scale for organizations to remain competitive. Such data insights not only enable organizations to make informed decisions swiftly but also empower them to adapt to rapidly changing market conditions. It has been witnessed that traditional analytics solutions often struggle to analyse large data sets quickly and flexibly. Enter Apache Druid, a real-time analytics database designed for fast analysis of large data sets.

In this blog, you will learn everything about Apache Druid and how it compares to other analytics software.

Let’s get started!

What is Apache Druid and why should you use it?

Apache Druid is an open-source, column-oriented, distributed data store designed for real-time analytics. It is designed to handle large volumes of high-dimensional data and provide low-latency queries on that data. Druid was originally developed by Metamarkets, Inc. and later became an Apache Software Foundation project.

It is an excellent choice for use cases that demand fast query performance, high availability, and real-time ingestion, such as providing real-time dashboards, monitoring application performance, and analysing consumer behaviour.

Wondering why you should? Well, the answer lies below.

Apache Druid is the ideal option because of these few key features:

Columnar storage format: Druid uses column-oriented storage, loading only necessary columns for faster query performance and optimizing storage based on data type.
Scalable distributed system: It can handle high data ingestion rates, retain large records, and maintain fast query response times in clusters with many machines.
Massively parallel processing: It can handle concurrent queries across the entire cluster.
Realtime or batch ingestion: It can ingest data in real time or batches, making it instantly available for querying.
Self-healing, self-balancing, easy to operate: It allows operators to easily scale up or down, automatically balancing and rerouting data in case of server failure. It can operate without interruption during updates or changes.
Indexes for quick filtering: It creates indexes for fast filtering and searching across multiple columns.
Cloud-native, fault-tolerant architecture: It stores a copy of data in deep storage for added security and availability, with replication ensuring queries during system recoveries.
Time-based partitioning: It splits data into time segments, improving performance by accessing only relevant partitions for time-based queries.
Approximate algorithms: It offers algorithms for approximate computations with low memory usage and faster speed, while also providing exact computations when accuracy is crucial.
Automatic summarization at ingest time: It can summarize data during ingestion, resulting in cost savings and performance improvements by pre-aggregating data.

With such features, you can give your specific domain the jump it requires. Here are the specific use cases:

Use Cases of Apache Druid

Wish to learn more about these use cases? Stay tuned for my upcoming blog that will dive into these practical depths.

Learn how we helped an airline crew accommodations provider streamline data consolidation and management using Power BI.

Download Case Study

Now that you are well-versed with the basics, let me help you decode the architecture next to get a complete overview of how things happen.

Apache Druid Architecture

Here’s what the architecture looks like:

Apache Druid Architecture

1. Coordinator service: Manages data availability by handling segment management and distribution, including loading new segments, removing unnecessary ones, and balancing segments across nodes. It runs periodically, evaluating cluster state for actions, and maintains connections to Zookeeper and a database for information.

2. Overlord service: Controls data ingestion workloads, receiving tasks, coordinating their distribution, establishing locks, and returning statuses. It processes a task queue by allocating tasks to middle manager nodes and offers a UI for monitoring job queues and accessing task logs.

3. Broker service: Acts as a pathway between external clients and historical and real-time nodes, receiving queries and routing them based on segment location, combining results, and optionally caching query results.

4. Router service: Routes requests to brokers, coordinators, and overlords based on configuration, ensuring queries for significant data are not influenced by less significant data.

5. Historical service: Stores queryable data, continuously connecting with Zookeeper and monitoring paths for new segment data without direct interaction with other processes.

6. Middle Manager service: The middle manager services handle data ingestion and execute submitted tasks. These tasks are delegated to peons, which run on separate JVMs (Java Virtual Machines) for resource and log isolation. Each peon can only handle one task at a time, but a middle manager can have multiple peons.

Next, let’s compare Apache Druid with other analytics platforms/software to help you navigate it more confidently.

Apache Druid vs ClickHouse

Refer to the following table to spot the differences:

Aspect	Apache Druid	ClickHouse
Architecture	– Supports separation of storage and compute for scalability and durability. – Flexible in cloud infrastructure.- Offers dedicated resources for multi-tenant environments. – Offers complex compute configurations with multiple nodes and types.	– Attempts to decouple storage and compute but doesn’t inherently separate. – Primarily on-premises but can be installed anywhere and offers cloud versions. – Inherently single-tenant but allows for flexible tenancy in its cloud offerings. – Requires users to provision compute resources for less abstraction but more control.
Scalability	– Offers elasticity for scaling larger data volumes and faster queries. – No elasticity for higher concurrency on-premises, requiring manual scaling and migration to bigger/smaller clusters.	– Lacks elasticity and requires manual scaling. – Requires manual scaling and migration to bigger/smaller clusters for higher concurrency on-premises.
Performance	– Exceptional concurrency and rapid ingest capabilities. – Requires custom sizing and cluster tuning for high concurrency and balancing compute, memory, and storage requirements. – While Managed Druid services are now available on the cloud, they currently lack granular scalability and are limited in their scaling capabilities.	– Provides linearly scalable performance for certain types of queries. – Scaling is a manual and tedious process. – Manages hardware internally, requiring users to set up clusters and migrate for scaling.
Storage & Indexing	– Uses compressed bitmap indexes for efficient data access and aggregations. – Uses a columnar storage format with time-based sorting. – Offers restrictive time-based partitioning but can partition based on other secondary columns.	– Employs a variety of index types, including primary, skipping, merge-tree, and join indexes. – Supports columnar storage in various formats. – Offers partitioning and Merge Tree Indexes.
Result & Warm Caching	– Supports caching on the broker, set to “off” by default. – Supports warm caching at a larger segment-level granularity.	– Does not offer result caching. – Supports warm caching at an indexed data-range level granularity.
Support	– Recommends flattening JSON or translating to an array before loading. – No support for JSON parsing at query runtime.	– Supports JSON functions and includes Lambda expressions.

Aspect

Apache Druid

ClickHouse

Architecture

– Supports separation of storage and compute for scalability and durability.

– Flexible in cloud infrastructure.- Offers dedicated resources for multi-tenant environments.

– Offers complex compute configurations with multiple nodes and types.

– Attempts to decouple storage and compute but doesn’t inherently separate.

– Primarily on-premises but can be installed anywhere and offers cloud versions.

– Inherently single-tenant but allows for flexible tenancy in its cloud offerings.

– Requires users to provision compute resources for less abstraction but more control.

Scalability

– Offers elasticity for scaling larger data volumes and faster queries.

– No elasticity for higher concurrency on-premises, requiring manual scaling and migration to bigger/smaller clusters.

– Lacks elasticity and requires manual scaling.

– Requires manual scaling and migration to bigger/smaller clusters for higher concurrency on-premises.

Performance

– Exceptional concurrency and rapid ingest capabilities.

– Requires custom sizing and cluster tuning for high concurrency and balancing compute, memory, and storage requirements.

– While Managed Druid services are now available on the cloud, they currently lack granular scalability and are limited in their scaling capabilities.

– Provides linearly scalable performance for certain types of queries.

– Scaling is a manual and tedious process.

– Manages hardware internally, requiring users to set up clusters and migrate for scaling.

Storage & Indexing

– Uses compressed bitmap indexes for efficient data access and aggregations.

– Uses a columnar storage format with time-based sorting.

– Offers restrictive time-based partitioning but can partition based on other secondary columns.

– Employs a variety of index types, including primary, skipping, merge-tree, and join indexes.

– Supports columnar storage in various formats.

– Offers partitioning and Merge Tree Indexes.

Result & Warm Caching

– Supports caching on the broker, set to “off” by default.

– Supports warm caching at a larger segment-level granularity.

– Does not offer result caching.

– Supports warm caching at an indexed data-range level granularity.

Support

– Recommends flattening JSON or translating to an array before loading.

– No support for JSON parsing at query runtime.

– Supports JSON functions and includes Lambda expressions.

Onwards to another comparison!

Apache Druid vs Apache Pinot

Take a moment to read the descriptive pointers that highlight the major differences between Druid and Pinot:

Based on Architecture:

Apache Druid and Apache Pinot are both real-time analytics databases but differ in deployment models, storage hierarchies, and isolation capabilities.

Druid supports SaaS or self-managed deployments, requiring customers to manage configuration, scaling, and capacity planning. It serves queries from disk and an in-memory cache, using cloud storage or HDFS for deep storage. It runs ingestion and queries on the same node by default, though this can be configured for non-real-time data. Druid does not separate compute and storage, though Imply’s offering does. It also lacks full isolation with replication for multiple applications.

Whereas Apache Pinot supports PaaS or self-managed deployments. It uses hot storage plus Deep Store for backup and restore operations, offering full isolation with replication for multiple applications. Pinot is a distributed real-time OLAP datastore capable of handling batch and streaming data input. Its architecture scales both vertically and horizontally but does not separate computation from storage like other OLAP databases.

Based on Ingestion:

Connectors that are pre-installed in Druid handle data intake from popular sources. Data needs to be flattened upon ingest since it does not handle nested data, in contrast to several Druid rivals. Additionally, denormalization is necessary during ingestion, which adds to the operational burden in some use scenarios.

High-performance intake from streaming data sources is supported by Pinot. Every table is available in real time or offline. In contrast to offline tables, which have a longer retention period and scale based on data volume, real-time tables have a short retention period and scale based on ingestion rate. To add deep storage, you will need to modify the controller and server configurations to store the created segments that collectively make up a table in a durable manner.

Based on Performance:

Druid is optimized for append-only use cases, with segments becoming immutable once committed and published. It supports a bitmap index and achieves sub-second query latency for large datasets, using a columnar format partitioned by time. However, it lacks support for JOINs and updates are only possible via batch jobs. Druid’s performance is enhanced by write-time aggregation and data denormalization.

Apache Pinot, on the other hand, treats all data as immutable by default, with upserts supported only for streaming ingest. It offers a variety of indexes, including a star-tree index, and can handle 50-1000ms queries on large datasets. Pinot stores data in a columnar format with additional indexes for fast filtering, aggregation, and group by operations. It supports 1-2 second ingest for streaming data. It can achieve sub-second query latency with high concurrency, but this requires considerable experience, management, and adjustment.

Based on Scalability:

Druid allows for manual adjustments of server sizes for vertical scaling, providing flexibility but requiring careful management. Pinot supports both vertical and horizontal scaling, with the ability to increase CPU and memory for each node and add more nodes to expand capacity.

However, capacity planning for Pinot is a complex, iterative process involving tuning across various parameters such as read and write QPS, streaming partitions, data size, and more. This contrasts with Druid, where users face complex decisions about scaling resources.

I’m confident that with these comparisons, you can make an informed decision based on your needs.

Feel free to reach out to us at Nitor Infotech company to know how big data and transformative analytical insights can help propel your business forward.