Data Processing with Apache Spark

About the author

Nitor Infotech Blog

Nitor Infotech is a leading software product development company serving ISVs, enterprises, and private equity firms globally.

Big Data & Analytics | 22 Jan 2021 | 8 min |

Spark has emerged as a favorite for analytics, especially those that can handle massive volumes of data as well as provide high performance compared to any other conventional database engines. Spark SQL allows users to formulate their complex business requirements to Spark by using the familiar language of SQL. So, in this blog, we will see how you can process data with Apache Spark and what better way to establish the capabilities of Spark than to put it through its paces and use the Hadoop-DS benchmark to compare performance, throughput, and SQL compatibility against SQL Server. Before we begin, ensure that the following test environment is available: SQL Server: 32 GB RAM with Windows server 2012 R2 Hadoop Cluster: 2 machines with 8GB RAM Ubuntu flavor Sample Data: For the purpose of this demo, we will use AdventureWorks2016DW data. Following table is used in query with no of records:

Table Name	No. Of Records
FactInternetSales	60458398
dimProduct	606
DimProductSubcategory	37
DimProductCategory	4
Dimcustomer	18484

We will compare performance of three data processing engines, which are SQL Server, Spark with CSV files as datafiles and Spark with Parquet files as datafiles. Query: We will use the following query to process data: select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount) from FactInternetSales f inner join dimProduct p on f.productkey = p.productkey inner join DimProductSubcategory ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey inner join DimProductCategory pc on pc.ProductCategoryKey = ps.ProductCategoryKey inner join dimcustomer c on c.customerkey = f.customerkey group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName Let’s measure the performance of each processing engine:

SQL Server:

While running query in SQL Server with the 32GB RAM Microsoft 2012 Server, it takes around 2.33 mins to execute and return the data. Following is the screenshot for the same:

Spark with CSV data files:

Now let’s export the same dataset to CSV and move it to HDFS. Following is the screenshot of HDFS with the CSV file as an input source.

Now that we have the files for the specific input tables moved to HDFS as CSV files, we can start with Spark Shell and create DataFrames for each source file. Run Following commands for creating SQL Context: import org.apache.spark.sql.types._ import org.apache.spark.sql.{Row, SQLContext} val sqlContext = new SQLContext(sc)

Run following command to create Fact Schema :

Run following command to create a DataFrame for Sales with Fact Schema:

Run following command to create Customer schema:

Run following command to create Customer a dataframe with Customer Schema.

Now create product schema with the following command:

Create product data frame with Product schema.

Now create Product Category schema using following command:

Now create Product Category Data frame with ProductCategory Schema:

Now create Product Sub Category schema using following command:

And create productsubcategory data frame using below command:

Now create temporary views of each data frame that we have created so far:

It took around 3 mins to execute the result set.

Spark with Parquet file for Fact Table:

Now, let’s convert FactInternetSaleNew file to parquet file and save to hdfs using the following command: salesCSV.write.format(“parquet”).save(“sales_parquet”) Create dataframe on top of Parquet file using below command: val sales = sqlContext.read.parquet(“/user/nituser/sales.parquet”) And create temp view using sales data frame: sales.createOrReplaceTempView(“salesV”) Now, we will run the same query which we used in step 2: val df_1=spark.sql(“””select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount) from salesV f inner join productV p on f.productkey = p.productkey inner join productSubCategoryV ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey inner join productCategoryV pc on pc.ProductCategoryKey = ps.ProductCategoryKey inner join customerV c on c.customerkey = f.customerkey group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName “””) It will return the same result set in less than 20 secs.

We can conclude by stating that Spark with commodity hardware performs very similar to the high-end server of SQL Server. However, Spark outshines other engines when it deals with column-oriented efficient and compressed storage format. So, we need to decide the specifications for the processing engine and storage based on business requirements, while also understanding how we can boost the power of such a highly efficient processing engine and get the required performance. Reach out to us at Nitor Infotech know more about Apache Spark and how you can utilize it to accelerate your business and make advanced analytics more innovative.