C# “数据按摩”是什么意思?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/577892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 09:00:47  来源:igfitidea点击:

What does "Data Massage" mean?

c#.netasp.netsqldatabase-design

提问by MrM

I am doing some reading, and came across avoiding an internalStore if my application does not need to massage the data before being sent to SQL. What is a data massage?

我正在做一些阅读,如果我的应用程序在发送到 SQL 之前不需要处理数据,我会避免使用 internalStore。什么是数据按摩?

采纳答案by Adam Davis

Manipulate, process, alter, recalculate. In short, if you are just moving the data in raw then no need to use internalStore, but if you're doing anything to it prior to storage, then you might want an internalStore.

操纵、处理、改变、重新计算。简而言之,如果您只是在原始数据中移动数据,则无需使用 internalStore,但如果您在存储之前对其进行任何操作,那么您可能需要一个 internalStore。

回答by tvanfosson

Clean up, normalization, filtering, ... Just changing the data somehow from the original input into a form that is better suited to your use.

清理、规范化、过滤……只需将原始输入中的数据以某种方式更改为更适合您使用的形式。

回答by Anthony

Sometimes the whole process of moving data is referred to as "ETL" meaning "Extract, Transform, Load". Massaging the data is the "transform" step, but it implies ad-hoc fixes that you have to do to smooth out problems that you have encountered (like a massage does to your muscles) rather than transformations between well-known formats.

有时,移动数据的整个过程被称为“ETL”,意思是“提取、转换、加载”。按摩数据是“转换”步骤,但它意味着您必须进行临时修复以消除您遇到的问题(就像按摩对您的肌肉所做的那样),而不是众所周知的格式之间的转换。

Thinks that you might do to "massage" data include:

认为您可能对“按摩”数据进行的操作包括:

  • Change formats from what the source system emits to what the target system expects, e.g. change date format from d/m/y to m/d/y.
  • replace missing values with defaults, e.g. Supply "0" when a quantity is not given.
  • Filter out records that not needed in the target system.
  • Check validity of records, and ignore or report on rows that would cause an error if you tried to insert them.
  • Normalise data to remove variations that should be the same, e.g. replace upper case with lower case, replace "01" with "1".
  • 将源系统发出的格式更改为目标系统所期望的格式,例如将日期格式从 d/m/y 更改为 m/d/y。
  • 用默认值替换缺失值,例如当没有给出数量时提供“0”。
  • 过滤掉目标系统中不需要的记录。
  • 检查记录的有效性,并忽略或报告如果您尝试插入它们会导致错误的行。
  • 标准化数据以去除应该相同的变化,例如用小写替换大写,用“1”替换“01”。

回答by HLGEM

And finally there is the less savory practice of massaging the data by throwing out data (or adjusting the numbers) when they don't give you the answer you want. Unfortunately people doing statistical analysis often massage the data to get rid of those pesky outliers which disprove their theory. Because of this practice referring to data cleaning as massaging the data is inappropriate. Cleaning the data to make it something that can go into your system (getting rid of meaningless dates like 02/30/2009 because someone else stored them in varchar instead of as dates, separating first and last names into separate fields, fixing all uppercase data, adding default values for fields that require data when the supplied data isn't given, etc.) is one thing - massaging the data implies a practice of adjusting the data inappropriately.

最后,当数据没有给你想要的答案时,通过丢弃数据(或调整数字)来按摩数据是一种不太可口的做法。不幸的是,做统计分析的人经常对数据进行处理,以去除那些反驳他们理论的讨厌的异常值。由于这种将数据清理称为按摩数据的做法是不合适的。清理数据以使其可以进入您的系统(摆脱无意义的日期,例如 02/30/2009,因为其他人将它们存储在 varchar 而不是日期中,将名字和姓氏分隔到单独的字段中,修复所有大写数据,在未提供提供的数据时为需要数据的字段添加默认值等)是一回事 - 按摩数据意味着不恰当地调整数据的做法。

Also to comment on the idea that it is bad to have an internal store if you are not changing any data, I strongly disagree with this (and I have have loaded thousands of files from hundreds of sources through the years. In the first place, there is virtually no data that doesn't need to at least be examined for for cleaning. And if it was ok in the first run doesn't guarantee that a year later it won't be putting garbage into your system. Loading any file without first putting it into a staging table and cleaning it is simply irresponsible.

还要评论如果您不更改任何数据就拥有内部存储是不好的想法,我强烈不同意这一点(并且多年来我已经从数百个来源加载了数千个文件。首先,几乎没有数据不需要至少检查以进行清理。如果第一次运行没问题,并不能保证一年后它不会将垃圾放入您的系统。加载任何文件不先把它放在临时台上并清洁它,这简直是不负责任的。

Also we find it easier to research issues with data if we can see easily the contents of the file we loaded in a staging table. Then we can pinpoint exactly which file/source gave us the data in question and that resolves many issues where the customer thinks we loading bad information that they actually sent us to load. In fact we always use two staging tables, one for the raw data as it came in from the file and one for the data after cleaning but before loading to the production tables. As a result I can resolve issues in seconds or minutes that would take hours if I had to go back and search through the original files. Because one thing you can guarantee is that if you are importing data, there will be times when the content of that data will be questioned.

此外,如果我们可以轻松查看加载到临时表中的文件的内容,我们会发现更容易研究数据问题。然后我们可以准确地确定哪个文件/源给了我们有问题的数据,这解决了许多问题,客户认为我们加载了他们实际发送给我们加载的错误信息。事实上,我们总是使用两个临时表,一个用于从文件中传入的原始数据,另一个用于清理后但在加载到生产表之前的数据。因此,如果我不得不返回并搜索原始文件,我可以在几秒钟或几分钟内解决问题,而这可能需要几个小时。因为您可以保证的一件事是,如果您正在导入数据,有时会质疑该数据的内容。