Context: Data Warehouse / Business Intelligence (ETL Execution)
Background:
If you have ever involved in DW/BI Testing, waiting for ETL execution to complete won't be an alien subject to you. Its not ONLY frustating to discover ETL fails at Step X after running for say 8 long hours but its ALSO a show stopper for you to continue with your functional testing and directly impacts your efficiency and effectiveness.
If there are performance lags (like low configuration h/w, network bandwidth etc), the ETL job is only going to take longer than expected and often times this adds to higher WAIT Time (a.k.a NON Productive time).
Root Cause:
If dev / test team wants to run the ETL with Production data, which can be really huge (in GBs or sometimes even TBs), due to the volume its obviously takes a long time.
Trade Off is REAL vs SIMULATED data. Testing with REAL data comes with a cost.
Is testing team the only victim?
Answer is NO. Even Dev team has to wait for entire ETL to run to see the changes made in the code while doing integration testing.
Why don't we create Smaller test data ALWAYS?
-> Due to the data integration and transformational complexities, the team has to have very deep domain knowledge of the business rules to create test data which satisfy the ETL conditions (often times lack of this results in ETL failure or ZERO o/p after the run)
-> If the complexity is high (consider data coming from multiple disparate sources) then creating data while maintainig logical relationship can turn out to be quite tedious and time consuming.
Why dont we SELECT REDUCED / SUBSET of Test data from Production?
This is the heart of this blog and we are going to emphasize on this in detail. Though the above sounds simple and the obvious solution but it has been observed that teams dont find it very popular time because of following important reasons:
1. Random sampling of data doesnt work. Many times it will result in ORPHAN RECORDS (where you might end up pulling child records without parents records from the source)
2. ETL might break if the keys are not found.
3. ETL might succeed but can result in ZERO output records as none of the business conditions got satisfied for the the sample data.
4. Business / Domain knowledge is required
What do we recommend to select SUBSET test data with success?
1. Select a pattern:
a) Master Data Driven
b) Transaction Data Driven:
2. Identify Business Conditions
3. Select Sampling Method
4. Copy the SELECTED test data to a test db
5. Modify Test Data to cover Boundary Value conditions
6. Run the ETL
What are the advantages?
1. Save Time
2. Save Effort
3. Save Hardware Cost (Less Data, Less Space and computational resources required)
4. Better Test Coverage (High Requirement Coverage)
5. Better Code Coverage
The subsequent post will focus on the above methods in detail....Please share your feedback and suggestions.
1 comment:
I read over your blog, and i found it inquisitive. yesterday i visit Conference and Events Venues. This is also a good site for
information. You may also visit.
Regards
George
Post a Comment