Real Time Analytics Using Spark With Cosmos DB Connector | Nitor Infotech
Send me Nitor Infotech's Monthly Blog Newsletter!
×
nitor logo
  • Company
    • About
    • Leadership
    • Partnership
  • Resource Hub
  • Blog
  • Contact
nitor logo
Add more content here...
Artificial intelligence Big Data Blockchain and IoT
Business Intelligence Careers Cloud and DevOps
Digital Transformation Healthcare IT Manufacturing
Mobility Product Modernization Software Engineering
Thought Leadership
Aastha Sinha Abhijeet Shah Abhishek Suranglikar
Abhishek Tanwade Abhishek Tiwari Ajinkya Pathak
Amit Pawade Amol Jadhav Ankita Kulkarni
Antara Datta Anup Manekar Ashish Baldota
Chandra Gosetty Chandrakiran Parkar Deep Shikha Bhat
Dr. Girish Shinde Gaurav Mishra Gaurav Rathod
Gautam Patil Harish Singh Chauhan Harshali Chandgadkar
Kapil Joshi Madhavi Pawar Marappa Reddy
Milan Pansuriya Minal Doiphode Mohit Agarwal
Mohit Borse Nalini Vijayraghavan Neha Garg
Nikhil Kulkarni Omkar Ingawale Omkar Kulkarni
Pooja Dhule Pranit Gangurde Prashant Kamble
Prashant Kankokar Priya Patole Rahul Ganorkar
Ramireddy Manohar Ravi Agrawal Robin Pandita
Rohan Chavan Rohini Wwagh Sachin Saini
Sadhana Sharma Sambid Pradhan Sandeep Mali
Sanjeev Fadnavis Saurabh Pimpalkar Sayanti Shrivastava
Shardul Gurjar Shravani Dhavale Shreyash Bhoyar
Shubham Kamble Shubham Muneshwar Shubham Navale
Shweta Chinchore Sidhant Naveria Souvik Adhikary
Sreenivasulu Reddy Sujay Hamane Tejbahadur Singh
Tushar Sangore Vasishtha Ingale Veena Metri
Vidisha Chirmulay Yogesh Kulkarni
Big Data | 08 Jan 2021 |   8 min

Real Time Analytics Using Spark With Cosmos DB Connector

featured image

How can you integrate Spark & Cosmos DB?

This blog helps you understand how Spark and Cosmos DB can be integrated allowing Spark to fully take advantage of Cosmos DB to run real-time analytics directly on petabytes of operational data!

High-Level Architecture

With the Spark Connector for Azure Cosmos DB, data is run in parallel with the Spark worker nodes and Azure Cosmos DB data partitions. Whether your data is stored in tables, graphs, or documents, you will achieve performance, scalability, throughput, and consistency all backed by Azure Cosmos DB

Accelerated connectivity between Apache Spark and Azure Cosmos DB increases your capacity to solve your rapidly moving data science challenges where information can be quickly maintained and retrieved using Azure Cosmos DB.

We can use the Cosmos DB Spark connector to operate Spark jobs with information stored in Azure Cosmos DB.

For cosmos DB, three kinds of spark connectors are accessible

  • SQL API Spark connector
  • MongoDB Spark connector
  • Cassandra Spark connector

Let’s develop Cosmos DB service using SQL API and query the data in our existing Azure Databricks Spark cluster using Scala notebook using Azure Cosmos DB Spark Connector.

Prerequisites:

Before proceeding further, ensure you have the following resources:

  • Active Azure account.
  • Azure Cosmos DB Account
  • Pre-configured Azure Databricks Cluster

Step 1: Cosmos DB creation

We must build a Cosmos Account in the first phase. We can build various databases under the Cosmos account. Each database includes a collection in which JSON formatted files are available. This database is similar to our RDBMS database; container is like a table, and item is like a row.

Create Database named Items and create Container Container1 in newly created database. We will be using these values in our script.

Step 2: Data Preparation

Data set: Please Yellow Taxi Trip Data from https://data.cityofnewyork.us/api/views/biws-g3hs/rows.csv?accessType=DOWNLOAD.

We will use a data migration tool to migrate data to Azure Cosmos DB. Download a pre-compiled binary from https://aka.ms/csdmtool, then run either of the following.

  • exe: Graphical interface version of the tool
  • exe: Command-line version of the tool

It is UI based tool, where we need to specify source and destination details. Once done, our data will be available on cosmos db.

Step 3: Azure Databricks Cluster Setup

Follow the steps at Azure Databricks to start setting up an Azure Databricks workspace and cluster – https://docs.azuredatabricks.net/getting-started/try-databricks.html

Step 4: Import the Cosmos DB connector library

  • Download the recent Azure-cosmosdb-spark library for the Apache Spark variant that you are running (azure-cosmosdb-spark_2.4.0_2.11_1.3.5) from https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.4.0_2.11/1.3.5/jar
  • Upload the downloaded JAR files to Databricks
  • Install the uploaded libraries into your Databricks cluster.

Step 5: Using Databricks notebooks

We have created notebook with Scala language. Azure Databricks supports Python, R and SQL also.

1. Import the required libraries to our notebook using the following command:

import org.joda.time._
import org.joda.time.format._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.functions._

2. We can create Cosmos DB configuration in our notebook

Val configMap = Map(
"Endpoint" -> {URI of the Azure Cosmos DB account},
"Masterkey"->{Key used to access the account},
"Database" ->{Database name},
"Collection"->{Collection name})
val config = Config(configMap)
  • Dataframe Creation : Let’s define a data frame variable df and read our current setup to read the Cosmos DB information.

val df = spark.sqlContext.read.cosmosDB(config)

3. Let’s create view from dataframe which we can use as table in Spark SQL

df.createOrReplaceTempView(“container1”)

4. Once you have connected to the table, you can create a Spark Data Frame (in the preceding example, this would be container1).

For example, below command will print the schema for our data frame.

container1.printSchema()

5. Run Spark SQL query against your Azure Cosmos DB table

val sql1 = spark.sql("""            SELECT ( 3959 * Acos(Cos(Radians(42.290763)) * Cos(Radians(PICKUP_LATITUDE)) *
Cos(Radians(PICKUP_LONGITUDE)
- Radians(
-71.35368)) +
Sin(Radians(42.290763)) *
Sin(Radians(PICKUP_LATITUDE))) ) AS
Distance
FROM   CONTAINER1
""") sql1.show()

And voila! You’re all set. You have enabled Spark to make the most of Cosmos DB and run real-time analytics using operational data.

Reach out to us at Nitor Infotech to learn more about how easy it is to integrate Spark with Cosmos DB and how it can be extremely advantageous for you to do so.

Related Topics

Artificial intelligence

Big Data

Blockchain and IoT

Business Intelligence

Careers

Cloud and DevOps

Digital Transformation

Healthcare IT

Manufacturing

Mobility

Product Modernization

Software Engineering

Thought Leadership

<< Previous Blog fav Next Blog >>
author image

Nitor User

   

You may also like

featured image

Getting Started with ArcGIS Online

GeoServer is an open-source server that facilitates the sharing, processing and editing of geospatial data. When we are dealing with a large set of geospatial d...
Read Blog


featured image

Getting Started with Angular: A Step-by-Step Guide: Part 1

Its primary 'why' is to craft single-page applications. Angular has distinct advantages as a framework, and it also offers a standard structure for developers to work with. In today’s blog, I’m go...
Read Blog


featured image

Automating End-to-end Cypress Testing for Geospatial Based Apps

On a busy day at Nitor Infotech, we were working on a business use case based on geospatial data. Unlike ordinary spatial data, geospatial data is information th...
Read Blog


subscribe

Subscribe to our fortnightly newsletter!

We'll keep you in the loop with everything that's trending in the tech world.

Services

    Modern Software Engineering


  • Idea to MVP
  • Quality Engineering
  • Product Engineering
  • Product Modernization
  • Reliability Engineering
  • Product Maintenance

    Enterprise Solution Engineering


  • Idea to MVP
  • Strategy & Consulting
  • Enterprise Architecture & Digital Platforms
  • Solution Engineering
  • Enterprise Cognition Engineering

    Digital Experience Engineering


  • UX Engineering
  • Content Engineering
  • Peer Product Management
  • RaaS
  • Mobility Engineering

    Technology Engineering


  • Cloud Engineering
  • Cognitive Engineering
  • Blockchain Engineering
  • Data Engineering
  • IoT Engineering

    Industries


  • Healthcare
  • Retail
  • Manufacturing
  • BFSI
  • Supply Chain

    Company


  • About
  • Leadership
  • Partnership
  • Contact Us

    Resource Hub


  • White papers
  • Brochures
  • Case studies
  • Datasheet

    Explore More


  • Blog
  • Career
  • Events
  • Press Releases
  • QnA

About


With more than 16 years of experience in handling multiple technology projects across industries, Nitor Infotech has gained strong expertise in areas of technology consulting, solutioning, and product engineering. With a team of 700+ technology experts, we help leading ISVs and Enterprises with modern-day products and top-notch services through our tech-driven approach. Digitization being our key strategy, we digitally assess their operational capabilities in order to achieve our customer's end- goals.

Get in Touch


  • +1 (224) 265-7110
  • marketing@nitorinfotech.com

We are Social 24/7


© 2023 Nitor Infotech All rights reserved

  • Terms of Usage
  • Privacy Policy
  • Cookie Policy
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it. Accept Cookie policy