When the data volume is not too huge, the entire data of the population can be validated as a part of the test phase (Unit testing/ Integration Testing/ System testing). Such a total study gives an excellent confidence on the quality and coverage which can be very expensive method, because validating all the data involves huge hardware cost, resource dependency and the time required for the completion of the activity. Remember, too, that the objectives of a project do not always require an absolutely exact account of the entire population.
Sampling is one approach which can be adopted when the data is voluminous. Note that sampling does not mean that you were not equally interested in all the items in the population. On the contrary, you would like to study all of them, but you pick the sample for practical reasons. Perhaps you have a population of millions of objects and it is impossible to reach even a major part of them. Also in those cases (with populations of, say, up to 10,000) where you might choose to study every object, the sampling study may be a prudent choice, because it saves your time and you can then use the time you save to study the sampled items more carefully.
Few months back, we were in the middle of testing our application, which processes more than 1.50 TB of data and takes approx 36 hours of time when we received this great news that Microsoft was voted as the Best Employer by BT TNS Mercer’s Best Companies to work for in India & it was no doubt a moment of pride but i wasn’t interviewed and none of my friends were either but it struck us how they were still able to echo our feelings so well (for most of us i spoke to were not surveyed)
After thinking it over for a couple of days we realized that this very same concept is very much applicable to our IT solution as well. Why do we have to process 1.5 GB of data and wait for 36 hours to run or why should i make up the data by generating it randomly and taking the risk of compromising on the coverage when i have access to customer/ production equivalent data (non PII)?
The Big Question:
“Why didn’t we do intelligent sampling yet?”
The Big Question:
“Choosing Sample Test Data from Population/ Customer data is not a challenge but how do you ensure that sample is a GOOD sample”
I agree, “GOOD” is a quite subjective term. Let me rephrase
“How do you ensure that the sample chosen is a GOOD representative of your entire data population?”
Prospects Hot Candidates:
Bigger the Size of font in the image below, more is the importance & applicability.
Best Sample selection based on your requirement is the true objective. There are multiple solutions in silo to cater to specific scenarios.
What better place to have this solution than in Visual Studio DbPro
Process: Intelligent Sampling Process Flow & Engine
- Selection Criteria
i) Population Selection
The data set which needs to be sampled should be chosen first. This can be chosen based on the parameters of
- Size – the dataset for sampling should be of reasonable size (>500 MB). This is to make the process of data sampling worth the effort
- Complexity – If the dataset to be sampled is of lesser complexity, data generation approaches may be more suitable. It is for complex data set which involves some business rules plus data integrity to be maintained where subset data selection would help.
- Test cases to be run – Subset data selection is for a scenario where number of test cases to be run is high and considerable amount for data regression is required
- Variety in data – The dataset to be sampled contains a variety in data. For Eg. Databases of an enterprise system would be ideal. On the other hand one which contains only Taxonomy related data or domain values may not be suitable for sampling
ii) Business Rules / Requirements or user input / Sampling Frame
Whenever a sample data is selected from an existing database, one needs to ensure that the underlying business rules are intact. In order to cater to negative test case scenarios, some sample breaking the rules is also fine. However on the whole, business rules have to be maintained.
Most datasets where sampling is required also have some ‘business rules’ attached to them. For Eg. Expense related calculations may need checks of exchange rates; billing related calculations may need checks of status and so on. If there are business rules which are critical to the transformation of the system to be tested, these can be input by the user, to ensure that the sampling happens accordingly.
iii) Get Expected Cost Associated From Sample from the User.
The user should also be able to select an expected associated cost to the sample which would be created. This cost would be based on the size/time taken for the sample generation and coverage based on data variety included in the sample. The expected associated cost entered by the user would be used to prioritize on the sample algorithms to be run, which are known to provide the expected cost. Running these algorithms with the expected cost as an input would help reach the required end result and propose different data sampling options to the user.
- Sampling Method
i) Data Profiling Required - Domain/Transactional Data
In order to go for effective sampling, we also recommend categorizing the existing dataset into tables containing ‘Master’* data and those containing transactionaldata.This could be an outcome of the Data Integrity relationship detected, or a step within the sampling engine while applying the algorithm. This categorization helps in selecting required amount of samples from each of these tables as well as selecting an appropriate sampling method. For instance, if a table contains date/calendar information, thus making it fall under a master category, would not need to be sampled or reduced; where as one containing daily transactions of an account can be a candidate for sampling. Detection of data set impacting the coverage is critical before coming up with the sampling plan.This would be done based on
1. Profiling the data and checking for columns which contain distinctively unique values which are sure to impact the test case/data coverage.
2. Get input from the user based on his prior knowledge.
Once this input is provided, save it to use during generating the sampling plans.
ii) Generate Plans with associated cost
Generate the different plans possible for data selection and provide the‘cost’ associated with it based on the rules applied prior to sampling This cost would be calculated based on following factors:
- Expected size of the sample
- Expected data coverage of the sample.
The plans along with the indicator of cost associated with each of the plan would be provided to the user. Thereafter, the user could either choose the best recommended plan, or go for another based on his discretion.
iii) User to choose plan or go for regeneration
If the user is not satisfied with the plans and their cost, he can ask for a regeneration of plan which would involve trying out a different set of inputs or providing different columns for driving data coverage.
- Sampling Engine
The sampling engine is expected to perform all of the above steps after scanning through the target population, as well as consuming specific inputs provided by the user. Next Steps are:
i. Implementing the sampling plan
ii. Reviewing the sampling process
The sampling engine is the core of the data selection process. Stages involved within the sampling engine:
i) Building the Sampling Rules
- The sampling rules would be built based on the Data Integrity Checks run on the database as well as after profiling the data for any exceptions to business rules and categorizing data sets into master or transactional.
- This could also include setting any delta rule condition (or in other words reducing the scope of data to be sampled
When a Data Volume is sampled already, the next time the Sampling Engine Runs on the same dataset, the sample is generated for the new added data rows only. This would save time, effort and money and would give more ROI to the proposed feature inclusion in VSTF DB Pro.
- These sampling rules would form the underlying conditions which need to be satisfied by the selected database.
- This ensures that after selecting a sample from the source system, when we run our ETL process on it, the ETL should not break due to incorrect sample set.
ii) Apply algorithm
- The algorithms chosen for application within the sampling engine are covered in next section (Section 7). Each of the above algorithms would be applied to the source system data and cost for the same would be calculated
iii) Calculate Sampling Cost :
We define sampling cost based on two parameters
- Sample Size
- Data coverage
The Sampling Cost would be calculated based on the Size (This could be Row Count of each table available in the sample or the actual Size in MB/GB). This percentage reduction in Size would act as one parameter in Cost. Data Coverage in the sample would be detected differently by differently algorithms. If the algorithm is probabilistic, frequency of data would be a key in determining coverage, if statistical algorithm is used, distinct values in columns would form key criteria and in T-Way, the requirements would be primary. This would be covered in detail in the subsequent section.
iv) Check for correctness and update the Sampling Rules
Identify if the sample satisfies the basic rules and if the user wishes he/she could opt for re-sampling, and this time fine-tune the rules as well as inputs provided.
v) Log Errors
Any errors encountered during the process of profiling/sampling method generation would have to be logged and reported.
The diagram below shows an end-to-end mock of the screens that could show up during this process of sampling method generation and selection and the various inputs that need to be provided by the user and the actions that take place.
- Selection Criteria
Sampling Algorithms selected for Database Sampling
Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination. Factors commonly influencing the choice between these designs include:
- Nature and quality of the frame
- Availability of auxiliary information about units on the frame
- Accuracy requirements, and the need to measure accuracy
- Whether detailed analysis of the sample is expected
- Cost/operational concerns
a) Probabilistic Sampling Methods - Multi-Stage Sampling
In most real applied application, we would use sampling methods that are considerably more complex than simple methods like (random sampling, cluster, stratified, systematic etc). The most important principle here is that we can combine the simple methods in a variety of useful ways that help us address our sampling needs in the most efficient and effective manner possible. When we combine different probabilistic sampling methods, we call this Multi-Stage sampling. By combining different sampling methods we are able to achieve a rich variety of probabilistic sampling methods that can be used in a wide range of social research contexts. This leads to less Sampling Error by relying on multiple randomizations by choosing random samples of preceding random samples.
3-Stage Multi-Stage Sample Process
- Select the clusters from the population
- Select all strata from each clusters
- In next stages, select additional random sample from selected strata units and so on
- Finally all ultimate samples selected are chosen
b) Statistical Sampling – Purposive Sampling
The difference between non-probability and probability sampling is that non-probability sampling does not involve random selection and probability sampling does. Most sampling methods are purposive in nature because we usually approach the sampling problem with a specific plan in mind. In purposive sampling, we sample with a purpose in mind. We usually would have one or more specific predefined groups we are seeking. Useful when you need to reach a targeted sample quickly and where sampling for proportionality is not the primary concern.
For ex. have you ever run into guys in a mall asking people if they could interview them? Most likely they are conducting a purposive sample. They might be looking for Caucasian females between 30-40 years old as that’s their need and they do that by estimating
One caveat is that you are also likely to overweight subgroups in your population that are more readily accessible.
c) T -Way Sampling Method
T-way Testing is a type of interaction testing which requires that for each t-way combination of input parameters of a system, every combination of valid values of these t parameters must be covered by at least one test case. It involves selecting test samples in such a manner that it covers all the t-wise interactions between the parameters and the possible values of a given system.
For example, Let us consider Employee Entity which contains attributes like Employee Discipline and Employee Role. Also, consider the Country Entity has attributes like CountryName and IsActive. Correspondingly, the possible values Employee Entity and Country Entity are as shown below.
The total Number of Combinations possible for values between Employee and Country = 3*2*3*2 = 36 combinations. We can also say, that our code will work if I have 1 combination, but neither of them strikes an optimal balance between Quality Assurance and Delivery!
Applying the t-way testing methodology and setting t=2, we get the Dataset which has only 9 combinations
Similarly, when t=3, the number of combinations increase but the data coverage also increases and hence the sample size increases, and it strikes an optimal balance between the size of the dataset and the code coverage. A sample run would generate 14 records for the current example.
We propose, set t=3 to find out the optimal data combination both in terms of coverage and time.
Cost/Benefit Analysis for the selected algorithms
This section will be used internally by the system to determine best algorithm for a selected population by doing a cost/benefit of each algorithm statistically
a) Cost Benefit for Statistical Algorithms
b) Cost Benefit for Statistical Algorithms
c) Cost Benefit for T-Way Algorithm
7.1 Avoid Re-sampling of the Sampled Data to increase data coverage
Sampling schemes should be generally defaulted to Without Replacement (‘WOR’). No element should be re-selected more than once in the subsequent sampling attempts if it’s already sampled in one of the past attempts (unless RESAMPLI NG = 1 is explicitly configured by the user). By Default, Re-Sampling Flag is set to 0 to ensure that already sampled data doesn’t always get picked in every sample run and rest of the data should get a fair chance of getting selected.
For example, if we catch fish, measure them, and immediately return them to the water before continuing with the sample, this is a WR design, because we might end up catching and measuring the same fish more than once. However, if we do not return the fish to the water (e.g. if we eat the fish), this becomes a WOR design.
Marking already sampled data helps ensuring that the already sampled data is not picked in the future runs unless we run out of data. This basically follows the assumption that the sampling method chosen by the user / system picks the best available sample for the first run but say for the second run we don’t want to pick the already sampled data (selected based on run #) and so on.
7.2 Weighted Sampling
Weights can be provided by the user in the proposed user configuration settings to give more value / importance to a specific column which is more important based on the business need.
More generally, data should usually be weighted if the sample design does not give each individual an equal chance of being selected. In many situations the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population.
Real World Example, a simple random sample of individuals in the United Kingdom might include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate.
The following are direct /indirect benefits to the various stakeholders like Dev, Test, PM’s & Business Community.
- Overall Test Execution Time Reduction
- Higher Requirement & Test Coverage
- Higher Code Coverage
- Effective Unit Testing & Functional Testing -Boundary Values, Equivalence Partitioning
- Reduces Hardware Cost for IT
- High confidence while delivering to Test, UAT & Production
- Increased Domain knowledge & business understanding of the resource while choosing & profiling the sample data
- Earlier detection of potential Data Quality issues by analyzing the data (while profiling) for sampling
Raj Kamal (rajkamal)