GDPR and testing: A few questions to ask yourself

I’ve been harping on about GDPR and other recent developments in compliance for years now, and it’s good to see QA organisations are now seriously grappling with compliance as a pressing issue. With each new data breach, and each new study on consumer concern for data privacy, the need to consider data privacy is only re-affirmed. Yet, what I consider to be higher risk practices still remain common in testing, and the latest World Quality Report finds that 60% of organisations still use raw production data in test environments for example.

Below, I’ve gathered together some research and news articles that have come out within the last year or so, each related to GDPR and compliance in some way. The intention is to use fresh data to re-iterates a point already well made by others: the practice of using raw production data in less secure test environments should be examined seriously. It should be scrutinised in terms of security, data breach prevention, and compliance, and only then should it be judged to be “okay”.

I’m no legal expert, and the below represents only my personal interpretation on the importance of recent legislation for testing best practices. However, I hope some of these questions provide some pause for thought. Please feel free to leave your comments on the impact of legislation for QA below, or drop me a direct message.

Questions to ask yourself:

  1. Do you have informed and actively given consent, or another legitimate ground for using that data? Can you show that you have permission from the EU ‘data subject’ to use their information in the way it’s being applied in test environments? You might have a lot of test cases, and a lot of data; what measures are in place to ensure consent, or another legitimate grounds for data processing, are being satisfied in testing?
  2. Are you abiding by the rules around Purpose Limitation and Data Minimisation? Do you know that the data is being used by only enough people, and kept for only long enough, to fulfil the service for which that person consented to the use of their data? Can you prove it if audited, or do you have another legitimate purpose for processing the data? Can you be sure that your test teams are not holding on to data indefinitely, perhaps unaware they still have it, even after consent has expired or been withdrawn?
  3. What about Purpose Limitation and the Right to Erasure? How reliably can you remove every instance of that person’s data in test environments if they request its deletion, or if you no longer need it to fulfil the service for which they provided it? Finding every instance of data quickly and reliably can be difficult with large IT estates, especially with a mixed back of new and legacy components. Storing sensitive data in test environments can make this worse, and tools and techniques will be needed for performing rapid and reliable data profiling and lookups. What if testers and automation engineers are keeping handy data in a handy spreadsheet on their local machine?
  4. What about citizen’s Right to Data Portability and Right to Erasure? Again, can you find every instance of data if someone asks for it to be deleted, or if they ask for a copy of it in a format readable by them? This must occur “without delay” – how good is your current infrastructure for finding every instance of data, copying it, and provisioning it in a readable format like an Excel spreadsheet?

The stakes are high:

You might answer ‘yes’ to some or all of the above questions, and some of the most advanced tech organisations can evidently rapidly find and provision user data upon request for example. However, in my view, these questions deserve careful, honest, and ongoing consideration. The stakes are high:

  1. Since the implementation of GDPR in 2018, there have been a whopping 278 data breach notifications per day.In time, we will also learn of the impact of the California Consumer Privacy Act, introduced this past New Year’s Day.
  2. The UK’s ICO and other national agencies are showing their willingness to serve unprecedented fines for data breaches. In July the ICO announced planned fines of £183 million and £99.2 million for instance.
  3. Consumers and the general public today care about data privacy, and are prepared to act on it. 97% US adults are “somewhat or very concerned about protecting their personal data.” 32% globally are “privacy actives”, who have already acted by switching companies or providers over data or data-sharing policies.

In my experience, several organisations lack the infrastructure or understanding of their complex data to be able to guarantee that they have located every instance of sensitive information in test environments. Extracting and provisioning that data rapidly can likewise be tricky, especially when working with a mixed bag of homegrown techniques. If that sounds familiar, the above questions around Erasure, Portability and Data Minimisation might be particularly pertinent.

If I decide that I cannot use production data, should I mask or generate? Or both?

The latest World Quality Report also finds that 65% of organisations anonymize at least some of the production data they use in testing, and over half generate synthetic test data. Masking can offer a way to mitigate against many compliance requirements when testing, as well as against the risk of a data breach. However, a few things should be considered when deciding how to create data to provision to test environments:

  1. Test data environments are necessarily less secure and would ideally therefore contain no personally identifiable information (PII) from a security standpoint. Ask yourself: How sure are you that no sensitive information can be garnered from masked data? What about when the information left visible in masked data sets is combined with other sources, for instance readily available information online or in other data sources available at your organisation?
  2. Masking is complex and can damage the integrity of data. This is particularly true when reckoning with complex data trends, for example temporal patterns in historical data. If you can mask and retain all the data relationships, you can most likely synthetically generate data from scratch using the same data model. While some great technologies exist for masking, you might consider generation in some as a way to create wholly fictitious data. This will furthermore also unlock the benefits of synthetic data generation.
  3. Masking existing data does nothing to improve the quality of the test data, or the speed with which it is allocated to tests. Why not turn compliance into an opportunity for faster, higher coverage, and potentially more accurate testing? Generating missing data needed to test applications rigorously offers a method for doing just this, especially when the data “Find and Makes” are performed automatically as a standard step within test execution.

In other words, synthetic test data generation is a technology that can enable greater security, while also facilitating more rigorous, faster testing. The reality is that few organisations will be able to wholly replace their data with comprehensive synthetic data over night. However, a hybrid approach is possible, gradually replacing production data sources with synthetic or virtualized data streams. This in turn feeds accurate and rigorous testing, often with less likelihood of sensitive data making it to test environments.

What do you think – do these align with your interpretation of current legislation and its relation to testing, and what are the main challenges we’re facing as a community to meet consumer concern for how we use their data or not? Please feel free to drop me an email with your thoughts.