SQL 匿名客户数据以进行开发或测试
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/260307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Anonymizing customer data for development or testing
提问by BradC
I need to take production data with real customer info (names, address, phone numbers, etc) and move it into a dev environment, but I'd like to remove any semblance of realcustomer info.
我需要使用真实客户信息(姓名、地址、电话号码等)获取生产数据并将其移动到开发环境中,但我想删除任何真实客户信息的表象。
Some of the answers to this questioncan help me generating NEW test data, but then how do I replace those columns in my production data, but keep the other relevant columns?
这个问题的一些答案可以帮助我生成新的测试数据,但是如何替换生产数据中的这些列,但保留其他相关列?
Let's say I had a table with 10000 fake names. Should I do a cross-join with a SQL update? Or do something like
假设我有一张包含 10000 个假名的表。我应该对 SQL 更新进行交叉联接吗?或者做类似的事情
UPDATE table
SET lastname = (SELECT TOP 1 name FROM samplenames ORDER By NEWID())
采纳答案by John Lemp
Anonymizing data can be tricky and if not done correctly can lead you to trouble, like what happened to AOL when they released search data a while back. I would attempt to create test data from scratch at all costs before I tried to convert existing customer data. Things may lead you to be able to figure out who the data belonged to using things such as behavioral analysis and other data points that you might not consider sensitive. I would rather be safe than sorry.
匿名数据可能很棘手,如果处理不当可能会给您带来麻烦,就像AOL 不久前发布搜索数据时发生的情况一样。在尝试转换现有客户数据之前,我会不惜一切代价尝试从头开始创建测试数据。事情可能会导致您能够使用行为分析和其他您可能不认为敏感的数据点来确定数据属于谁。我宁愿安全也不愿后悔。
回答by tomjedrz
This is easier than it sounds if you understand the database. One thing that is necessary is to understand the places where personal info is not normalized. For instance, the customer master file will have a name and address, but the order file will also have a name and address that might be different.
如果您了解数据库,这比听起来容易。需要做的一件事是了解个人信息未规范化的地方。例如,客户主文件将具有名称和地址,但订单文件也将具有可能不同的名称和地址。
My basic process:
我的基本流程:
- ID the data (i.e. the columns), and the tables which contain those columns.
- ID the "master" tables for those columns, and also the non-normailzed instances of those columns.
- Adjust the master files. Rather than trying to randomize them, (or make them phony), connect them to the key of the file. For customer 123, set the name to name123, the address to 123 123rd St, 123town, CA, USA, phone 1231231231. This has the added bonus of making debugging very easy!
- Change the non-normal instances by either updating from the master file orby doing the same kind of de-personalization
- ID 数据(即列),以及包含这些列的表。
- 标识这些列的“主”表,以及这些列的非规范化实例。
- 调整主文件。与其尝试将它们随机化(或使它们伪装),不如将它们连接到文件的密钥。对于客户 123,将名称设置为 name123,将地址设置为 123 123rd St, 123town, CA, USA,电话 1231231231。这样可以使调试变得非常容易!
- 通过从主文件更新或通过执行相同类型的去个性化来更改非正常实例
It doesn't look pretty, but it works.
它看起来不漂亮,但它有效。
回答by Sam
There are a couple of tools out there to remove sensitive data from databases that I've found. Note that I haven't tried any of them myself:
有一些工具可以从我发现的数据库中删除敏感数据。请注意,我自己没有尝试过其中任何一个:
- Data::Anonymization, a RubyGem.
- Mysql Anonymous, a Python script.
- Data::Anonymization,一个 RubyGem。
- Mysql Anonymous,一个 Python 脚本。
There's also a collection of sanitisation DB scripts here which might be helpful: https://gist.github.com/Tyriar/d3635c6b6e32ac406623
这里还有一些可能有用的消毒数据库脚本:https: //gist.github.com/Tyriar/d3635c6b6e32ac406623