×
Yash Patidar
Senior Software Engineer
Yash Patidar is a seasoned Senior Software Engineer at Nitor Infotech, specializing in data analysis, machine learning, deep learning, and data... Read More

Machine learning has revolutionized the way we analyze data and make predictions across various industries since the past few years. However, what if you could harness the power of machine learning directly within your SQL queries? Enter BigQuery ML, a significant innovation that seamlessly integrates machine learning capabilities into Google BigQuery.

In this blog, I’ll help you learn about BigQuery ML, its architecture, and key features. Additionally, you’ll delve into the details of how it works with a real-life example.

So, let’s get started with the basics!

What is BigQuery ML?

Now let’s get to know about its key features and explore how it is revolutionizing the landscape of data analysis and predictive modeling.

Key Features of BigQuery ML

1. Familiar SQL Interface: BigQuery ML integrates smoothly with SQL, enabling data analysts and engineers to leverage their existing SQL skills for machine learning tasks. With this integration, you can effortlessly develop, train, and evaluate ML models right within BigQuery.

2. Wide Range of Algorithms: It supports several types of machine learning models, including regression, classification, clustering, collaborative filtering, and time-series forecasting. Each model serves a unique purpose depending on your data and the problem you’re trying to solve.

3. Scalability and Performance: Built on Google’s infrastructure, it can handle massive datasets with ease. It automatically scales to accommodate growing data volumes, ensuring fast and efficient model training even with petabytes of data.

4. Automatic Feature Engineering: It simplifies the feature engineering processes by automatically generating features from raw data. It includes tasks like normalizing numerical features and encoding categorical variables. This helps to speed up model development and streamline the data preparation process.

5. Easy Model Evaluation and Deployment: From AI (Artificial Intelligence) model training to evaluation and deployment, it provides a seamless end-to-end workflow within the BigQuery environment. Users can easily evaluate model performance and deploy trained models for inference, integrating predictive analytics seamlessly into their existing workflows.

collatral

Learn how we helped a retail company with a GenAI-powered solution for swift data analysis and deep customer insights.

Now that you’re familiar with the basics, let me walk you through BigQuery’s architecture before jumping into the example.

Architecture of BigQuery

BigQuery’s architecture efficiently handles large-scale data processing to meet diverse analytical needs.

  • At its core is Colossus, Google’s distributed file system, which provides robust storage infrastructure capable of managing petabytes of data across thousands of machines, ensuring high availability and fault tolerance.
  • Another key component is Dremel. This is a powerful query engine optimized for interactive ad-hoc queries. Using a columnar storage format and a distributed execution model, Dremel allows for efficient execution of complex SQL-like queries, delivering rapid insights even on vast datasets.

BigQuery integrates with Jupyter notebooks, offering a familiar environment for data exploration and analysis. This promotes collaboration and experimentation.

With its serverless model and automatic scaling capabilities, it abstracts the complexities of infrastructure management, allowing users to focus on extracting valuable insights from their data.

Architecture of BigQuery

Fig: Architecture of BigQuery

Now, let’s get into the nitty-gritties with an example.

To embark on your machine learning journey with BigQuery ML, all you need is a Google Cloud Platform account and access to BigQuery.

Enabling BigQuery ML in your Google Cloud project is a straightforward process. You can do it through the Google Cloud Console or use the bq command-line tool.

Building a Classification Model for predicting the Survivors in the Titanic Disaster

Titanic
Imagine the Titanic disaster when the huge ship wrecked.

This approach doesn’t aim to retell the events of the sinking. Instead, it focuses on analyzing historical data about the passengers. By considering factors like age, class, and other relevant information, we can attempt to estimate the likelihood of survival for each passenger based on these factors.

First, we need to create a dataset in BigQuery named Titanic and import the downloaded Titanic CSV file as a table in the Titanic dataset.

To know more about the dataset, you can visit Kaggle’s website.

Note: We have created a new project named ‘titanic-421418’ for this exercise. So, don’t get confused when you see the name of the project for referencing the tables and models. Also, for the exploratory data analysis, we have used the training dataset only. The code for splitting the dataset into train and test set is included in the later part of this blog.

Step 1: Understanding the Variables

We will start by understanding the variables in our dataset. The Titanic dataset includes both numerical and categorical variables. Here are the datasets for both:

  • Numerical Variables: PassengerId, Survived, Pclass, Age, SibSp, Parch, and Fare
  • Categorical Variables: Name, Sex, Ticket, Cabin, and Embarked

Step 2: Data Analysis

Next, we’ll analyze the data based on three parameters.

a. Follow this code to get insights about the data according to gender:

Data according to gender1

Data according to gender 2

Conclusion from the analysis – Female passengers have a higher likelihood of survival compared to males.

b. Follow this code to get insights about the data according to class:

Data according to class 1

Data according to class 2

Conclusion from the analysis – Passengers of higher class (PClass 1) are more likely to survive.

c. Follow this code to get insights about the data according to age:

Data according to age 1

Data according to age 2

Conclusion from the analysis – Elderly individuals are more likely to travel in higher classes.

Step 3: Data Cleaning

It’s time to clean the data to ensure we have the most relevant and accurate information. We need to follow these steps to clean the data:

  • Dropping Irrelevant Columns: We’ll drop high cardinality columns like PassengerId and irrelevant columns like Name and Ticket as they add no value to our prediction efforts.
  • Combining Related Features: SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard) will be combined into one column named Total_number_of_relatives, that will provide a more meaningful representation.
  • Dropping Fare and Cabin Columns: Since Fare information is already contained in the Pclass column and the Cabin column has a lot of null values with no logical way of handling them, we’ll drop these columns.

Step 4: Encoding Categorical Variables

Now, we’ll encode the remaining categorical variables, like “Sex” and “Embarked”, to numerical values to make them suitable for modeling.

Step 5: Handling Missing Values

In this step, we’ll replace null values in the Age column with the average age of the Pclass that the passenger belongs to. This will ensure that we keep the integrity of the data while handling missing values effectively.

Step 6: Model Training and Evaluation

Here’s the breakdown of the coded parts starting from table creation to evaluation for training and test data:

Creating Temporary Table

First, we will create a temporary table with a random “is_training” flag to split data into train and test set. We will follow this code:

DECLARE

split_ratio FLOAT64 DEFAULT 0.8;

CREATE OR REPLACE TABLE

titanic-421418.Titanic.training_or_holdout AS (

SELECT

*,

RAND() < split_ratio AS is_training

FROM

titanic-421418.Titanic.data);

CREATE OR REPLACE TABLE

titanic-421418.Titanic.train AS

SELECT

*

FROM

Titanic.training_or_holdout

WHERE

is_training;



CREATE OR REPLACE TABLE

titanic-421418.Titanic.test AS

SELECT

*

FROM

Titanic.training_or_holdout

WHERE

NOT is_training;

Note: The Titanic dataset initially contained 891 records, split into training and testing datasets in an 80:20 ratio.

a. Cleaning the Training Data

Now, we need to clean the training data. To do so, we will use this code:

CREATE OR REPLACE TABLE

titanic-421418.Titanic.train_clean AS

SELECT

PassengerId,

ROUND(IFNULL(Age, AVG(Age) OVER (PARTITION BY Pclass)), 2) AS Age,

CASE

WHEN Sex = 'male' THEN 0

WHEN Sex = 'female' THEN 1

ELSE

NULL

END

AS Sex,

CASE

WHEN Embarked = 'C' THEN 0

WHEN Embarked = 'Q' THEN 1

WHEN Embarked = 'S' THEN 2

ELSE

NULL

END

AS Embarked,

Pclass,

(SibSp + Parch) AS Number_of_Relatives,

Survived

FROM

titanic-421418.Titanic.train;

b. Creating the Model

To create the model, we will follow this code:

CREATE OR REPLACE MODEL

`titanic-421418.Titanic.log_reg` OPTIONS(model_type='logistic_reg') AS

SELECT

Age,

Sex,

Embarked,

Pclass,

Number_of_Relatives,

Survived AS label

FROM

`titanic-421418.Titanic.train_clean`;

c. Evaluation Stage

After creating the model, we will evaluate it using this code:

SELECT

*

FROM

ML.EVALUATE(MODEL `titanic-421418.Titanic.log_reg`);

In the preceding steps, we successfully trained the model. Following the training, BigQuery automatically computes the evaluation metrics for the model.

Here are the results:

Evaluation metrics
Next, we’ll use new data reserved for testing the model to make predictions. To do this, we’ll apply the same preprocessing steps used for the training data.

a. Cleaning the test data

We first need to clean the test data. To do so, we will use this code:

CREATE OR REPLACE TABLE

titanic-421418.Titanic.test_clean AS

SELECT

PassengerId,

ROUND(IFNULL(Age, AVG(Age) OVER (PARTITION BY Pclass)), 2) AS Age,

CASE

WHEN Sex = 'male' THEN 0

WHEN Sex = 'female' THEN 1

ELSE

NULL

END

AS Sex,

CASE

WHEN Embarked = 'C' THEN 0

WHEN Embarked = 'Q' THEN 1

WHEN Embarked = 'S' THEN 2

ELSE

NULL

END

AS Embarked,

Pclass,

(SibSp + Parch) AS Number_of_Relatives

FROM

titanic-421418.Titanic.test;

b. Prediction

Next, we’ll use this code for the model to predict the results:

SELECT

*

FROM

ML.PREDICT(MODEL `titanic-421418.Titanic.log_reg`,

(

SELECT

PassengerId,

Age,

Sex,

Embarked,

Pclass,

Number_of_Relatives

FROM

`titanic-421418.Titanic.test_clean`));

c. Confusion Matrix

To sort out the data and avoid any overlapping, use this code:

CREATE OR REPLACE TABLE

Titanic.test_predicted AS

SELECT

A.PassengerId,

B.Survived AS actual,

A.predicted_label AS predicted

FROM (

SELECT

*

FROM

ML.PREDICT(MODEL `titanic-421418.Titanic.log_reg`,

(

SELECT

PassengerId,

Age,

Sex,

Embarked,

Pclass,

Number_of_Relatives

FROM

`titanic-421418.Titanic.test_clean`)))A

INNER JOIN

`titanic-421418.Titanic.test` B

ON

A.PassengerId = B.PassengerId ;

SELECT

actual,

predicted,

COUNT(*) AS count

FROM

Titanic.test_predicted

GROUP BY

actual,

predicted

ORDER BY

actual,

predicted;

Confusion Matrix

This is how we can calculate the accuracy of the trained model using predictions on the test dataset.

Accuracy = (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives) = (51 + 97) / (51 + 16 + 97 + 17) = 148 / 181 = 0.81

So, it can be concluded that the model’s accuracy is very close to the accuracy calculated during the training phase on the holdout dataset.

In a nutshell, whether it’s predicting survival rates or tackling complex business problems, BigQuery ML makes machine learning accessible and intuitive. So, dive in and unlock the potential of BigQuery ML for your data analysis needs.

Feel free to reach us at Nitor Infotech for your diverse analytical requirements and scale your business faster.

subscribe image

Subscribe to our
fortnightly newsletter!

we'll keep you in the loop with everything that's trending in the tech world.

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it. Accept Cookie policy