Blog QA Touch

Why Test Data Quality Matters?

March 9, 2020
Why test data quality matters?

In the Agile/DevOps era, testers are in constant pressure to deliver software at an accelerated rate to provide a competitive advantage to the businesses. Quality is not negotiable no matter how quickly or frequently you release.  The end-users are growing impatient and less tolerant of defects. Production issues increase customer dissatisfaction and lead to loss of revenue and reputation for your business.

Several times testers fail to discover the problems that customers encounter. Why?
Sometimes the same functionality may work for one user but not another. Why?

There may be many causes! Many times it’s because of the Test data that is being used.

Quality Testing Demands Quality DataQuality Testing Demands Quality Data

Testing is an activity that consumes and generates huge amounts of data. Data is a critical component of testing. Low test coverage allows defects to slip through the QA phase. Test coverage largely depends on the Quality of the Test Data. No matter how good your test cases are, without the right test data your testing is never complete. I have observed that many testers do not use pragmatic test data when they test. All the test cases they design may have unvaried test data or have poor data that are incapable of finding bugs.

Testing is effective when testers mirror the conditions found in the production. Testers as customer advocates must mimic the way the customer tests the product, both in terms of user behavior/actions and user data. Data in production is varied and will include characters that may not work well with your code, such as Unicode, special characters, whitespaces.
QA Touch
Testing with pragmatic data will make your product robust because you’ll find bugs that are likely to occur in production. However, it can be challenging to create the right test data and simulate real-world conditions in a test environment. Gathering the test data turns out to be a real pain for testers. Several times testers do not have permission to access sources of data. Testers may depend on developers who are busy with feature development for the data they require.

Test data management(TDM) is a critical part of your overall testing strategy. TDM is the administration of the data that is required for satisfying the requirements of test processes ensuring quality data is made available on demand. TDM is often overlooked and has remained mostly unchanged in spite of the advancements in test automation and transformation towards Agile/DevOps.

Production Data

In software testing, Production data is the king! Production data can be a great source of data diversity and testing with production data reflects real usage.
Production data

Can you use production data for testing?

Yes, you can. But you need to ensure that you mask/obfuscate/anonymize and subset your data for testing. Avoid using raw production data for testing as there are legal, privacy and compliance concerns related to using customer data. Regulations like GDPR were designed to protect the personal information of individuals and define how organizations should process personal data. Data masking or data obfuscation is hence absolutely necessary to hide the original data with modified content.

Another challenge when using production data in testing is that the data volumes are huge. It is not practical to copy all the data in production to your test environments as you do not have sufficient storage to handle the staggering volumes of data as that in production. Data subsetting creates a copy of a database that contains only a portion of the data and is still able to reflect the variety of production data. Subsetting helps you improve security, and reduce storage costs of production data for testing.

How to load production data in test environments?

  • Test data management tools

Commercial TDM(Test data management) tools can manage data from several data sources and provide capabilities like subsetting, data masking, cloning, provisioning, Sensitive data discovery and classification, test data-generation and so on. An ideal TDM tool should enable teams to get control over the provisioning of test data ensuring that the test cycles are reduced. Beware, enterprise TDM tools can be really expensive!

  • Build in-house tools and automated scripts

Take initiatives to build an in-house test data management tools that will fulfill your use case. Automate the process of Test data management within your organization.

Challenges with production data

  • Testers have limited control over the quality of data obtained from data subsetting. The data subset might not have all the permutations required for the test cases like negative data/boundary values/Missing data which can cover edge cases.
  • Masking data can be expensive, slow and time-intensive.
  • Data refreshes(update stale data with new data) can be slow and error-prone.
  • TDM with simple subsetting techniques can fail to maintain interrelationships in complex data.
  • Challenge in preserving the characteristics of production data like data types, data complexity, referential integrity after masking, obfuscation or anonymization.
  • Production data might lack the necessary data needed to test new functionality.
  • Production data can be highly repetitive and focus on a happy path with fewer outlier results.

Synthetic Test Data

Synthetic test dataSynthetic test data does not use any data from the production DBs. It is programmatically generated and does not contain sensitive data. Synthetic test data is often the better choice when you do not have enough data or don’t want to wait for real data to be produced. You can use synthetic data without worrying about privacy and compliance related to personal data. Modern TDM tools have traditional TDM capabilities like masking and subsetting and also generate synthetic data that looks like real-world data. The quality synthetic test must be able to exhibit multifaceted business logic and rules.

Challenges with Synthetic Data

  • Synthetic data might not replicate your production data’s complexity and referential integrity(data accuracy and consistency when linked between two or more tables).
  • Synthetic data isn’t rich enough to cover edge cases realistically.

How to create synthetic data as a tester?

You can write a script or there are several tools and libraries that allow you to generate huge amounts of data with ease. A well-designed software provides the capability to generate and manage its data. As a tester, you should leverage different available ways to load data into your databases according to your context.

  1. Create data directly in your DB’s using scripts.
  2. From GUI/Frontend –
    • You might generate a huge CSV and upload it from UI during your automated UI tests to generate data that you need. 
    • You can have an in-house application to feed data from UI to simplify creating data on the fly for your exploratory testers. 
  3. Using APIs –
    • You can play with data by leveraging your applications CRUD APIs to easily create data that you need for your exploratory and automated tests. Automated UI and API tests can employ prerequisite test data generation and clean up/teardown using the CRUD APIs.

Provisioning production data or generating synthetic data either of the approaches should be automated. Testers should employ automated test data generation to reduce time, effort and costs.

Ideas to reduce data related production issues

  • Shortlist customers who experience frequent issues than others and use their dataset(of course after anonymization) to test.
  • Real user monitoring can be helpful to understand real-user behavior and actions. Testers can then create test ideas or update test cases reflecting real-user behavior and realistic test data for effective testing.
  • Synthetic data can be used in combination with production data. This is called partial synthetic data, where we replace sensitive sections of production data with synthetic data. Generate partial synthetic data to cover edge cases that are not found in production.


Testing teams cannot afford to ignore Test data management. Implementation of Test Data Management will greatly increase testing effectiveness by enabling teams to quickly create large volumes of data on demand. This helps teams to focus on creative exploratory testing rather than worrying about generating realistic test data. Problems arise when QA teams depend either on production data or synthetic data alone. A wise approach to Test data management is to strike the right balance and combination between Production data and Synthetic data instead if choosing one over another.

Leave a Reply