Just in case the term is new to you, ETL is defined from data warehousing and stands for Extract-Transform-Load. It covers the process of how the data is loaded from the multiple source system to the data warehouse.
In this blog, I’d like to share what I know about:
- The importance of ETL testing
- Its approach
- Types
- Methodologies
- Bugs
- Best practices
- Key challenges
- Tools
Let’s start!
Importance of ETL testing
ETL testing ensures that data extracted from multiple systems is complete, transformed as per business rules, and loaded accurately in a data warehouse.
- Intelligent business decisions
Verification and validation of business rules ensure accurate data is populated on reports which help a business to make intelligent decision. - Timely availability of data
Timely availability of accurate and recent data helps to make accurate decisions on time. Availability of historical and latest data should be as per requirements. - Performance of data processing
Data is stored in de-normalized form by using star or snowflake schema in granular form which ensure processing of large amount of data quickly - Important of data security
Row-level security verification ensures only required information is exposed to an authorized user.
Let’s now turn to the approach.
Approach of ETL testing
ETL testing mainly focuses on the quality of data which is getting loaded into multiple data layers from the source. The process is like this:
Source-> Staging-> ODS-> Data Mart.
Here are the main categories of ETL testing:
- Data accuracy: Ensure that the data is loaded accurately as per transformation business rules and is as expected.
- Data completeness: Verify that all the expected data is loaded.
- Data integrity: Ensure that the data relations are maintained as expected.
Types of ETL testing
1. Metadata testing
Ensure table name, column name, data type, data length, constraints, index and relationships for all source and target tables is as per the mapping document.
Example: In MS SQL, we can verify this by using INFORMATION_SCHEMA which contains TABLES, COLUMNS, and TABLE_CONSTRAINTS.
2. Integrity testing
Ensure the relationship between the parent and child table is maintained as expected after every data refresh. Data present in the child table as Foreign Key is mapped with the parent table primary Key.
Example: Check the count of unmatched rows in foreign key table by running queries on source and target data.
3. Data comparison testing
Design complex SQL queries to compare data from source to target tables. As source data is present in heterogenous tables or CSV files, we need to design an ETL package to fetch all source date and load as it is in combined source tables. This can be effectively compared with target tables.
For a large amount of data, we can compare unique or NULL values between key columns and use aggregate functions like Min, MAX, Avg to compare aggregation logics.
Example: We can verify this by using except queries. In the case of a huge amount of data, we can focus on important columns to be compared.
4. Data transformation testing
Ensure data which is loaded in target system is as per business transformation rules. For each transformation rules develop SQL queries on source tables and compare with data mart table measures. For effective test coverage, we can add permutation and combinations on source data to cover all scenarios.
Example: Verify business rules/data validation rules by developing SQL queries
5. Data quality testing
Ensure expected data is only loaded in target tables. Unwanted data is eliminated or replaced with valid data as per business requirement.
Example: null key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field).
Now that you know about the types of ETL testing, let’s examine its methodologies.
Learn how we enabled accurate decision-making for a leading tech enterprise with Talend as the ETL tool.
ETL Testing Methodologies
Unit Testing:
The goal of unit testing is to verify mapping fields and business rules at an early stage of the development cycle in feed or staging layer.
Test Scenarios:
- Verify mapping of fields present in all tables.
- Verify data type constraints of all fields.
- Verify surrogate keys correctness.
- Check error messages into target table.
- Check data flow between Source, Transformation and target component using the data profiler in ETL tools.
- Check for string columns are left and right trimmed.
Integration testing:
The goal of integration testing is to ensure that end to end workflows or the ETL package is executed as per expected outcomes.
Test Scenarios:
- Verify data loading from Feed-> Stage-> ODS-> Datamart layers.
- Verify initial and incremental data load as per data load methodologies.
- Validate all target tables are getting loaded with correct data.
- Monitor ETL packages’ execution time and performance.
- Validate job configuration and schedules.
System Testing:
The goal of system testing is to ensure all transformation logics are implemented as per business requirements. It majorly focuses on data validation.
Test Scenarios:
- Ensure number of records in source tables is compared with target tables.
- Verify duplicate checks and NULL values.
- Verify data is loaded in expected format as per transformation rules.
- Verify all aggregated measures in DataMart tables.
If you would like to read about more methodologies, read our blog.
Allow me to tell you about bugs in ETL testing.
ETL Testing Bugs
Here is a list of possible bugs you may come across:
- Data type and length for an attribute may be different as compared to mapping requirement.
- Misuse of referential integrity constraints
- NULLs are getting populated for not null fields.
- Data is missing in tables because lack of integrity constraints.
- Accurate data is not loaded in target tables.
- Views are not created correctly to process data.
- Duplicate data is loaded where it’s not expected.
- Data is not getting matched as per business rules.
- Truncation of characters due to column size and data type mismatch
- Wrong mapping of columns in the ETL which may result in wrong data or nulls getting populated in the target.
- Errors in implementing transformation logic may result in bad data in the target table.
Bugs like these can be avoided if best practices are followed. Let’s take a look at some…
Best Practices for ETL Testing
- Understanding and analysis of source data
It is necessary to understand the volume and the type of data you will be handling. You need to be familiar with data types, schema, and other details for testing metadata of database. - Data flow
Understand data flow model between Feed-Stage-ODS-Datamart layers. Get detailed information about stored procedures or workflows which load data in multiple layers. Get details about important tables, columns, and relationship between tables for effective data load testing. - Load pattern
Understand data load pattern which works on daily basis and perform incremental and full load data load. Validation of data loading process is important. - Business rules
Transformation logics defined by business get clear understanding and implement queries on feed tables to compare Source data with BI report data. - Test data generation
For accurate business rule testing, you need to work on test data creation as per requirement. Generate data as per scenarios defined to validate data. - ETL performance
Execute ETL jobs with a huge amount of data and ensure effective processing time is taken by the ETL package.
Let’s now look at key challenges in ETL testing.
Key Challenges in ETL Testing
Here are some of the major obstacles:
- Testing on different platforms
One must ensure that data fetched from multiple platforms. - Lack of experienced ETL testers
Experienced test analysts with sound knowledge of BI and profound experience in developing complex SQL queries are difficult to find. - Test data generation
One must understand business requirements and create test data by considering all transformation scenarios. - Test automation of ETL solution
ETL testing is data-driven testing and requires exhaustive testing. Test automation of end-to-end process is right from data extraction to loading.
Now it’s time to get acquainted with tools used in ETL testing.
ETL Testing Tools
Query Surge
It is one of the automations ETL testing tools that leverage analytics for data validation and ETL testing. It is a smart data testing solution that automates data validation and testing.
Informatica Data Quality
It is an analyst tool. As soon as data is loaded, it will highlight duplicate, NULL values.
iCEDQ
It automates end-to-end ETL testing with complete accuracy and increased coverage. iCEDQ is an ETL Testing platform, designed to identify any data issues in and across structured and semi-structured data.
Well, there you have it! Allow me to wrap up my ideas…
ETL testing requires ensuring data accuracy, correctness, and quality on BI reports which helps the end user to make accurate decisions at the right time. An end to end testing process plays a crucial role to ensure that extracted data is loaded in the data warehouse as per expected transformation business rules.
Reach out to us with your thoughts about this blog. Visit us at Nitor Infotech to learn about our quality engineering offerings.