Data Processing with Apache Spark | Nitor Infotech
Send me Nitor Infotech's Monthly Blog Newsletter!
×
nitor logo
  • Company
    • About
    • Leadership
    • Partnership
  • Resource Hub
  • Blog
  • Contact
nitor logo
Add more content here...
Artificial intelligence Big Data Blockchain and IoT
Business Intelligence Careers Cloud and DevOps
Digital Transformation Healthcare IT Manufacturing
Mobility Product Modernization Software Engineering
Thought Leadership
Aastha Sinha Abhijeet Shah Abhishek Suranglikar
Abhishek Tanwade Abhishek Tiwari Ajinkya Pathak
Amit Pawade Amol Jadhav Ankita Kulkarni
Antara Datta Anup Manekar Ashish Baldota
Chandra Gosetty Chandrakiran Parkar Deep Shikha Bhat
Dr. Girish Shinde Gaurav Mishra Gaurav Rathod
Gautam Patil Harish Singh Chauhan Harshali Chandgadkar
Kapil Joshi Madhavi Pawar Marappa Reddy
Milan Pansuriya Minal Doiphode Mohit Agarwal
Mohit Borse Nalini Vijayraghavan Neha Garg
Nikhil Kulkarni Omkar Ingawale Omkar Kulkarni
Pooja Dhule Pranit Gangurde Prashant Kamble
Prashant Kankokar Priya Patole Rahul Ganorkar
Ramireddy Manohar Ravi Agrawal Robin Pandita
Rohan Chavan Rohini Wwagh Sachin Saini
Sadhana Sharma Sambid Pradhan Sandeep Mali
Sanjeev Fadnavis Saurabh Pimpalkar Sayanti Shrivastava
Shardul Gurjar Shravani Dhavale Shreyash Bhoyar
Shubham Kamble Shubham Muneshwar Shubham Navale
Shweta Chinchore Sidhant Naveria Souvik Adhikary
Sreenivasulu Reddy Sujay Hamane Tejbahadur Singh
Tushar Sangore Vasishtha Ingale Veena Metri
Vidisha Chirmulay Yogesh Kulkarni
Big Data | 22 Jan 2021 |   8 min

Data Processing with Apache Spark

featured image

Spark has emerged as a favorite for analytics, especially those that can handle massive volumes of data as well as provide high performance compared to any other conventional database engines. Spark SQL allows users to formulate their complex business requirements to Spark by using the familiar language of SQL. So, in this blog, we will see how you can process data with Apache Spark and what better way to establish the capabilities of Spark than to put it through its paces and use the Hadoop-DS benchmark to compare performance, throughput, and SQL compatibility against SQL Server. Before we begin, ensure that the following test environment is available: SQL Server: 32 GB RAM with Windows server 2012 R2 Hadoop Cluster: 2 machines with 8GB RAM Ubuntu flavor Sample Data: For the purpose of this demo, we will use AdventureWorks2016DW data. Following table is used in query with no of records:

Table Name No. Of Records
FactInternetSales 60458398
dimProduct 606
DimProductSubcategory 37
DimProductCategory 4
Dimcustomer 18484

We will compare performance of three data processing engines, which are SQL Server, Spark with CSV files as datafiles and Spark with Parquet files as datafiles.   Query: We will use the following query to process data: select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount) from FactInternetSales f inner join dimProduct p on f.productkey = p.productkey inner join DimProductSubcategory ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey inner join DimProductCategory pc on pc.ProductCategoryKey = ps.ProductCategoryKey inner join dimcustomer c on c.customerkey = f.customerkey group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName Let’s measure the performance of each processing engine:

  • SQL Server:

While running query in SQL Server with the 32GB RAM Microsoft 2012 Server, it takes around 2.33 mins to execute and return the data. Following is the screenshot for the same:

  • Spark with CSV data files:

Now let’s export the same dataset to CSV and move it to HDFS. Following is the screenshot of HDFS with the CSV file as an input source.

Now that we have the files for the specific input tables moved to HDFS as CSV files, we can start with Spark Shell and create DataFrames for each source file. Run Following commands for creating SQL Context:   import org.apache.spark.sql.types._ import org.apache.spark.sql.{Row, SQLContext} val sqlContext = new SQLContext(sc)

Run following command to create Fact Schema :

Run following command to create a DataFrame for Sales with Fact Schema:

Run following command to create Customer schema:

Run following command to create Customer a dataframe with Customer Schema.

Now create product schema with the following command:

Create product data frame with Product schema.

Now create Product Category schema using following command:

Now create Product Category Data frame with ProductCategory Schema:

Now create Product Sub Category schema using following command:

And create productsubcategory data frame using below command:

Now create temporary views of each data frame that we have created so far:

It took around 3 mins to execute the result set.

  • Spark with Parquet file for Fact Table:

Now, let’s convert FactInternetSaleNew file to parquet file and save to hdfs using the following command: salesCSV.write.format(“parquet”).save(“sales_parquet”) Create dataframe on top of Parquet file using below command: val sales = sqlContext.read.parquet(“/user/nituser/sales.parquet”) And create temp view using sales data frame: sales.createOrReplaceTempView(“salesV”)   Now, we will run the same query which we used in step 2: val df_1=spark.sql(“””select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount) from salesV f inner join productV p on f.productkey = p.productkey inner join productSubCategoryV ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey inner join productCategoryV pc on pc.ProductCategoryKey = ps.ProductCategoryKey inner join customerV c on c.customerkey = f.customerkey group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName “””)   It will return the same result set in less than 20 secs.

We can conclude by stating that Spark with commodity hardware performs very similar to the high-end server of SQL Server. However, Spark outshines other engines when it deals with column-oriented efficient and compressed storage format. So, we need to decide the specifications for the processing engine and storage based on business requirements, while also understanding how we can boost the power of such a highly efficient processing engine and get the required performance. Reach out to us at Nitor Infotech know more about Apache Spark and how you can utilize it to accelerate your business and make advanced analytics more innovative.

Related Topics

Artificial intelligence

Big Data

Blockchain and IoT

Business Intelligence

Careers

Cloud and DevOps

Digital Transformation

Healthcare IT

Manufacturing

Mobility

Product Modernization

Software Engineering

Thought Leadership

<< Previous Blog fav Next Blog >>
author image

Nitor Infotech Blog

Nitor Infotech is a leading software product development firm serving ISVs and enterprise customers globally.

   

You may also like

featured image

10 Heuristic Principles in UX Engineering

Say, you’ve built a modern, cutting-edge application. It has a complex, multi-layered user interface (UI), that is the basis for some amazing features. Since you’re the one who has built the applic...
Read Blog


featured image

ETL Testing: A Detailed Guide

Just in case the term is new to you, ETL is defined from data warehousing and stands for Extract-Transform-Load. It covers the process of how the data is loaded from the multiple source system to t...
Read Blog


featured image

Getting Started with ArcGIS Online

GeoServer is an open-source server that facilitates the sharing, processing and editing of geospatial data. When we are dealing with a large set of geospatial d...
Read Blog


subscribe

Subscribe to our fortnightly newsletter!

We'll keep you in the loop with everything that's trending in the tech world.

Services

    Modern Software Engineering


  • Idea to MVP
  • Quality Engineering
  • Product Engineering
  • Product Modernization
  • Reliability Engineering
  • Product Maintenance

    Enterprise Solution Engineering


  • Idea to MVP
  • Strategy & Consulting
  • Enterprise Architecture & Digital Platforms
  • Solution Engineering
  • Enterprise Cognition Engineering

    Digital Experience Engineering


  • UX Engineering
  • Content Engineering
  • Peer Product Management
  • RaaS
  • Mobility Engineering

    Technology Engineering


  • Cloud Engineering
  • Cognitive Engineering
  • Blockchain Engineering
  • Data Engineering
  • IoT Engineering

    Industries


  • Healthcare
  • Retail
  • Manufacturing
  • BFSI
  • Supply Chain

    Company


  • About
  • Leadership
  • Partnership
  • Contact Us

    Resource Hub


  • White papers
  • Brochures
  • Case studies
  • Datasheet

    Explore More


  • Blog
  • Career
  • Events
  • Press Releases
  • QnA

About


With more than 16 years of experience in handling multiple technology projects across industries, Nitor Infotech has gained strong expertise in areas of technology consulting, solutioning, and product engineering. With a team of 700+ technology experts, we help leading ISVs and Enterprises with modern-day products and top-notch services through our tech-driven approach. Digitization being our key strategy, we digitally assess their operational capabilities in order to achieve our customer's end- goals.

Get in Touch


  • +1 (224) 265-7110
  • marketing@nitorinfotech.com

We are Social 24/7


© 2023 Nitor Infotech All rights reserved

  • Terms of Usage
  • Privacy Policy
  • Cookie Policy
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it. Accept Cookie policy