Big Data represents all kinds of opportunities for all kinds of businesses, but collecting it, cleaning it up, and storing it can be a logistical nightmare. According to the latest IDC research the big data technology and services market is showing a compound annual growth rate of 27% and it will be worth $32.4 billion by 2017.

Data warehouses have a vital role to play in Big Data as companies collect information on their business operations, markets, and client behavior to identify patterns, and aggregate the results to identify business opportunities / operational improvements. As Gartner’s Magic Quadrant for Data Warehouse Database Management Systems points out, “A data warehouse is simply a warehouse of data, not a specific class or type of technology.”

That’s one of the reasons that Data Warehouse testing is growing increasingly important for many organizations as they try to leverage the opportunities hidden in the data.

Data Warehouse “ETL” Processing

The main component (brain and heart) of the data warehouse system is data Extraction, Transformation, and Loading (ETL). The challenge is to extract the data, often from a variety of disparate systems, to transform that data, so that it is uniform in terms of format and content, and finally to load the data into a warehouse where it can serve as the basis for future business intelligence needs.

It’s very important that the integrity of the data is maintained. It must be stored clearly and concisely without loss, and it must be accessible for analysts. In some industries, such as healthcare, there will be strict governance rules and potentially severe consequences for error. If data warehouses are to deliver value, they require careful ETL testing to ensure that processes are working as intended.

Unique Challenges of ETL Testing

Testing the data warehouse and verifying that ETL processes are working correctly is very different to traditional application testing. Here are some of the main barriers to overcome:

  • There is no user interface – In traditional application testing,
    testers can validate functionality via the input / output of data through the user interface.
    The user interface is expected to behave a certain way based on data input and actions taken.
    In data warehouse testing instead of a user interface that a tester can generally grasp quite
    quickly, you have data and its relationships. The key testing skills needed here are the ability
    to look at data, validate data processing rules, and analyze final data output. Knowledge of
    database query languages, such as SQL, is essential for testers to accurately do this where
    traditional manual testing skills are not enough.
  • The volume of data is huge – We could be looking at millions of transactions
    per day to verify that the extraction, transformation, and loading of that data is working as
    designed, in real-time as code is updated.
  • Data is from multiple sources – This is not a single system, it could be a
    long list of different systems that are feeding daily transactional data in the warehouse. Some
    of the data may even come from other systems utilized in cloud computing or hosted by a 3rd party.
    The systems and their data won’t necessarily agree on the format and content with each other.
    How do you collate that data and make sure everything gets processed consistently and connected
    with each other?
  • Incomplete data and exceptions – The information collected by various source
    systems may not be complete and in many cases may be full of exceptions. You may find that some
    systems collect more data than others. You may find the data is incomplete or imperfect.
  • Rules are not static – The source systems can change over time due to release
    upgrades or switching to another 3rd party vendor’s solution. You must be able to cope with these
    changes without having to redesigning the data warehouse.

High Level Approaches For ETL Testing

The entire process is about taking data from a source file, cleaning it up in a temporary staging area, and then pushing it into the data warehouse. It’s vital to ensure that no data goes missing or gets incorrectly transformed. There are two high level approaches to ETL data validation that can be considered:

  • Approach I: Validate the data from the data sources directly in the data warehouse. This
    approach validates that all the data in the source appears in the data warehouse according to
    the business rules. This approach does not validate the intermediate staging area and
    transformation processes in between the source and data warehouse.
  • Approach II: Validate the data from the data sources through each step of the extract and
    transformation process including final load in the data warehouse. Validate the data at each
    transformation. For example, in stage one you would take the data from the file and verify
    it’s all there; in stage two, you would remove any junk information that you don’t want to
    process; in stage three, you would translate the data to clean it up and make it uniform, and so on.

The second approach, where testers are verifying each step in the process, is more time-consuming but it makes it much easier to track down any problems that occur. On the other hand, the first approach is less time-consuming, but makes it a lot more difficult to track down a bug when one is found.

Key Areas of ETL Testing

For the ETL process to be verified there is a lot to test. Here are some of the major considerations that should form a part of your planning. Ask yourself these questions:

  • Was all the expected data loaded from the source file? Bugs that result in dropped
    records are common. You must do a full inventory count and ensure that the right records
    made it into the right tables across the source landing area, staging area, and data
    warehouse. Were some records rejected because of missing data? Find the problems and
    work out why they occurred.
  • Were all the records loaded into the right place? If the transaction order gets out
    of sync you can end up with all the records being processed, but some data gets over
    written with the older data or in the wrong place. It’s important to verify that each
    data field from each source loads in the correct field / table and in the correct order.
  • Do you have duplicate data? If you don’t remove duplicate data then the analysis and
    aggregate reports could be skewed. Duplicates can happen due to the source system feeding
    duplicates or due to defects in the ETL processes. In either case, the duplicate check must
    happen as data moves across landing, staging, and finally into the data warehouse.
  • Do the data transformation rules work? You may receive data with different date formats,
    or using different codes or descriptions for the same thing. The transformation process has
    to homogenize that data correctly, so that it is stored consistently. Most data warehouses
    have many such transformation rules which must all be verified with sample data that represents
    all the various permutations / possibilities.
  • Have you lost any part of data? Maybe truncation has caused a problem because the source
    file was too big or changed the format. It is important to make sure data from source file
    did not get truncated as it gets processed from landing to staging to the data warehouse.
  • Is data being corrected and rejected correctly? Sometimes the data may be incomplete or
    unclear. It may be necessary to hold this data aside and move on with rest of the data. Such
    data must be logged and rejected for further clarification before it can be re-processed.
  • Is the system generating error logs? You need to know when data is not being processed and
    why. You also need a system that can gracefully handle something like a system crash or power
    outage without creating duplicates, losing records, or forcing the user to reprocess from the
    start (as much as possible).
  • Can the system handle the data load? Performance might not be an issue at first, but you
    have to think ahead. How much data will the system have to handle 2 years from now? What
    happens when the system already has 5+ yeas of data and new data has to be processed? It must
    be able to cope with the expected load, not just now, but in the future too.

There’s a lot to consider for a smooth running ETL process in your data warehouse. Proper testing is a prerequisite to reaping the benefits and insights in the big data that’s being collected. This article should give you an idea of the challenges that lie ahead and what needs to be tested.


About the Author

Kaushal Amin Kaushal Amin is Chief Technology Officer for KMS Technology, a software development and IT services firm based in Atlanta, GA and Ho Chi Minh City, Vietnam. He was previously VP of Technology at LexisNexis and a software engineer at Intel and IBM. You may reach him at kaushalamin@kms-technology.com.