Masking is not a complete test data solution: Moving from GDPR compliance to full-blown “Test Data Automation”

I’ve written fairly frequently on the impact of the GDPR on testing, often responding to the news and research that continues to flow in. The below infographic summarises my thinking. It draws on news and research from 2019-2020 to show the need to address test data privacy issues today.*

*As always, just my two cents on this. Hopefully an interesting opinion, but I’m no qualified legal expert and this is not legal advice.
GDPR Compliance in testing Infographic
Figure 1 – The cost of non-compliance: test data privacy matters today.

The stats tell a pretty consistent story: the risk of a data breach continues to rise, as do the associated fines and brand damage. Meanwhile, the most effective way to mitigate this risk in testing is to limit the sharing of sensitive information to test environments.

I’ve listed the sources for the infographic below this article and I recommend that you take a look around to see just how high the stakes are for GDPR compliance. But first, let’s consider some ways in which GDPR compliance might impact testing and development, before considering what a good and bad response to this challenge looks like.

If the proposed approach sounds interesting, be sure to join Curiosity on May 12th for our next webinar – Test Data Automation: Delivering Quality Data at Speed.

The significance of the GDPR for testing and development

The impact of the GDPR on testing and development can be far-reaching. The particular issues that might need addressing in QA include:

  1. The logistics of fulfilling the “Right to Erasure” and “Right to Portability”: Organisations’ IT estates are often not set up to find and either copy or erase every instance of a person’s data “without delay”. Test environments are often sprawling and poorly understood, including legacy components and uncontrolled spreadsheets. Organisations today therefore rarely know reliably where sensitive information resides, and even fewer have the techniques to identify/copy/erase every instance of one person’s data “without delay”.
  2. Demonstrating legitimate grounds of processing data in QA: Structures are often lacking for demonstrating that each use of personal data in testing and development fulfils one of the grounds for its legitimate processing. These grounds include active and informed consent, legitimate business interest, and national security.
  3. Compliance with data minimization and limitation: Organisations must furthermore only share enough data with as many people as are required to fulfil the legitimate grounds for data processing, and those people must keep that data for only as long as required to fulfil that purpose. Organisations should be able to demonstrate this, but this is again a challenge when many organisations simply do not know what data resides where, nor how it is being used.

The challenge of test data compliance

It’s worth emphasising also that the maximum fines for non-compliance have increased substantially, and that national agencies have already shown willingness to impose them for Data Breaches.

Ultimately, the best defence against sensitive information leakage is to limit the number of people and places to which sensitive information is shared. QA environments are an avoidable place to which PII is routinely copied, and testing is therefore a good place to begin in the battle to ensure GDPR compliance.

Test data masking is a common technique deployed to reduce the risk of exposing sensitive information to less secure test environments. Anonymizing production data can soften the impact of compliance requirements in testing; however, it often also introduces significant bottlenecks in testing and development.

Masking data consistently from numerous sources is slow and complex as it must retain the complex relationships which exist within and across those data sources. Often, a central Ops team must work overtime to fulfil constant requests for data refreshes, but they can never deliver data fast enough to parallel test teams:

Logistical approaches to test data undermine agility
Figure 2 – Logistical and linear approaches to test data provisioning undercut parallelism, speed, and agility.

Precious QA time is then wasted waiting idly for out-of-date copies of production, while tests fail due to misaligned or outdated data. Testers and automated tests further compete for the same data, leading to further delays as data is either constrained or has been used up.

Copying production data furthermore does nothing to improve the quality of production data.

The historical data sets by definition lack the data needed for testing unreleased functionality, while production activity focuses almost exclusively on “happy path” journeys through the system. Production data therefore lacks the outliers and negative scenarios needed for sufficient test coverage, leaving systems exposed to damaging defects in production:

Figure 3 – Production Data lacks the combinations needed for rigorous testing.

Moving to a complete solution

A complete test data solution must look beyond the removal of PII from test environments – it must also ensure that the data testers need is readily available when and where they need it. That means data combinations for each and every test, available in particular test environments at the speed demanded by automated test environments and iterative delivery.

The good news is that this approach, which Curiosity call “Test Data Automation”, builds on many of the skills and technologies used currently in Test Data Management. The following techniques can be added readily to existing masking and subsetting processes.

1. Synthetic test data generation

Generating combinations of data not found in production data plugs the gaps in test coverage stemming from low variety test data. Today, data generation can furthermore be readily integrated into data masking, adding in any missing combinations automatically as anonymized data moves to test environments.

This is a quick win for increasing testing quality while retaining test data compliance. It produces test data sets that do not contain sensitive production information, but do contain the outliers, edge cases and negative scenarios needed to test new or updated functionality.

2. On-the-fly test data allocation

Test data “allocation” can today replace the provisioning of full-size data copies, allocating exact data combinations based on test case definitions. This allocation is furthermore possible “just in time”, finding and making data combinations as tests are generated or executed.

In this approach, each test comes equipped with an up-to-date and unique data combination, cloned and allocated as it runs. This eliminates the delays caused by manual data provisioning and invalid test data, as well as the frustration of competing for shared data sources. Each test and each tester instead has their own isolated data set, available on demand and in parallel.

3. Standardized, re-usable TDM processes

To enable true QA agility, the dependency of testers on a data provisioning team must be removed. The silo between test data provisioning and test execution must be eliminated, providing on tap access to rich test data.

One way to achieve this is to make the processes used in data provisioning re-usable by test teams themselves.

Fortunately, these processes can be readily automated. They are by definition rule based, as they must reflect the relationships that exist in complex data. They can therefore be standardized and exposed to test teams on demand.

In this approach, test teams can parameterise and embed re-usable test data processes within their automated testing and CI/CD pipelines. Data is then found, made and prepared as the tests run, removing the dependency on an overworked Ops team.

In principle, any technique currently deployed by that Ops team can be exposed to test teams on demand. One approach creates self-service web portals to enable testers to re-use these data processes in their automated testing:

Figure 4 – Self-service web portals remove the upstream dependency on a central team.

See Test Data Automation in practice

Test Data Automation accordingly builds on the techniques used today to ensure test data compliance. It incorporates these processes in a holistic test data solution that also facilitates testing rigour and agility.

Sound too good to be true?  Come and see Test Data Automation in action. Join Curiosity on May 12th for our next webinar – Test Data Automation: Delivering Quality Data at Speed.

References

[1] Danny Palmer (ZDNet: 20/01/2020), GDPR: 160,000 data breaches reported already, so expect the big fines to follow, Retrieved on 15/04/2020

[2] Luke Irwin (IT Governance: 09/03/2020), Infographic: Cyber Attacks and Data Breaches of 2019, retrieved on 15/04/2020/

[3] BBC (08/07/2019), British Airways faces record £183m fine for data breach, retrieved on 14/04/2020.

[4] British Airways faces record £183m fine for data breach

[5] British Airways faces record £183m fine for data breach

[6] Greg Sterling (Marketing Land: 04/12/2019), Nearly all consumers are concerned about personal data privacy, survey finds, retrieved on 15/04/2020.

[7] Thomas C. Redman and Robert M. Waitman (Harvard Business Review: 28/01/2020), Do You Care About Privacy as Much as Your Customers Do?, retrieved on 15/04/2020.

[8] Infographic: Cyber Attacks and Data reaches of 2019

[9] Infographic: Cyber Attacks and Data Breaches of 2019

[10 Rob Davies and Dominic Rushe (The Guardian: 14/07/2019), Facebook to pay $5bn fine as regulator settles Cambridge Analytica complaint, retrieved on 15/04/2020.