Today, good marketing relies on having detailed and accurate customer data. And companies, not surprisingly, are eager to collect vast troves of it. For instance, Amazon continuously tracks the behaviors of its 100 million Prime members, an example of “first-party” data. And many companies have found that sharing their own customer information with other companies creates synergies for both parties, especially with the increasing availability of “internet of things” data (GPS sensors, smart utility meters, fitness devices, etc.). These are examples of “second-party” data. Finally, many companies supplement their first-party data with “third-party” data from companies like Acxiom, which collects up to 1,500 data points on 700 million consumers worldwide.
The potential to conduct effective data-driven marketing with these augmented databases is enormous. At the same time, concerns about customer privacy have never been higher because of numerous, widely-publicized privacy hacks such as the recent Facebook-Cambridge Analytica scandal. Consumer responses to these privacy breaches range from increasing reluctance to share their data, to massive erosion of trust in the brand. For instance, when Yahoo’s three billion user accounts were hacked, Verizon lowered its purchase price for the company by $350 million.
- Sponsored by GoogleThe science of storytelling and brand performance.
Studies have shown that consumers are willing to share information with a brand that they trust will protect their information. Greater regulation is being enacted to ensure that businesses are accountable, and that consumers have the right to delete, transfer, or obtain a copy of their data. For instance, the General Data Protection Regulation (GDPR) took effect in the European Union on May 25th, and is being closely watched in the U.S.
The trillion-dollar question is whether it is possible for businesses to reap the promised benefits of data-driven marketing while maintaining the privacy of customers’ data.
Current approaches to protecting data
The most common data protection approach currently being followed by businesses is to control access to the data after it’s been gathered. This access control approach is woefully inadequate for multiple reasons. First, as soon as a company shares data either internally or externally, its ability to control access deteriorates rapidly. Further, practices like pseudonymization (which will be required by GDPR) — defined as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information” — are not sufficient, as we explain below.
Consider the example shown in the exhibit below, where two retailers enter a second-party data sharing partnership. Although Retailer B’s data was pseudonymized by removing all personally identifiable information, it is not really anonymous because the combination of age range, timestamp, gender, and zip code creates a unique population record which can be linked to the additional information from Retailer A. Although these retailers may comply with the law, there is a significant privacy risk to consumers.
Synthetic data as protection
Public agencies like the U.S. Census Bureau and the Department of Agriculture that collect sensitive data (e.g., typical purchases by Supplemental Nutrition Assistance Program beneficiaries) are required by law to share the data publicly. These agencies follow an approach of transforming the original data to protected data, which are then released. In this approach, the sensitive variables that need to be protected in the original data are systematically perturbed using methods like the following (to illustrate, we use the example of protecting weekly sales of retail stores in point-of-sale data):
- Adding random noise. For example, observations are grouped into deciles based on sales, and a random number is added to the sales in each decile.
- Rounding. For example, sales are rounded to the nearest hundred
- Top coding. For example, all sales above a threshold value, such as 100, are set equal to 100.
- Swapping. For example, observations are divided into groups and their sales data are exchanged.
- Aggregating. For example, weekly sales are summed and prices and promotions are averaged across stores within a market.
- Creating synthetic data. For example, sales are simulated from a probability distribution.
These agencies use the process of perturbation to manage the trade-off between preserving the useful information in the original data, while reducing the opportunity for an intruder to violate privacy. The original data are kept in secure access environments unless deletion is required. We believe that businesses should consider taking a page out of the playbook of these agencies to strengthen their own data protection practices.
We have shown in two published articles (here and here) how a statistical model can be used to convert original marketing data to synthetic data for the protection of consumers. A key idea in this approach is that the marketing goals for which the data are being gathered are taken into account in the process of synthesizing, thereby carefully trading off the loss of information with the gain in protection.
For instance, consider a very widely used form of data — retail point-of-sale data — which is gathered by marketing research companies like ACNielsen and SymphonyIRI from retail stores. The data is then aggregated across the retail stores within a market in order to prevent the stores from being identified, and is purchased by almost all major consumer packaged goods companies, like Procter & Gamble and Unilever. Brand managers use the data to monitor how their brands are performing, as well as to compute marketing metrics like price elasticities and promotion lift factors. However, the aggregation can severely distort the metrics that brand managers use to make important decisions, like how much to spend on trade promotions. An alternative approach to protect the stores’ identities is to convert the original data to synthetic data using a statistical model. Our research has demonstrated that this approach provides dramatically more accurate metrics than aggregate data, yet protects the stores’ identities very well.
The promised benefits of data-driven marketing are at grave risk unless businesses can do a better job of protecting against unwanted data disclosures. The current approach of controlling access to the data or removing personally identifiable information does not control the risk of disclosure adequately. Other approaches, such as aggregation, lead to severe degradation of information. It’s time for businesses to consider using statistical approaches to convert the original data to synthetic data so they remain valuable for data-driven marketing, yet adequately protected.