pandas sklearn LabelEncoder 和 pd.get_dummies 有什么区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38413579/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the difference between sklearn LabelEncoder and pd.get_dummies?
提问by Sam
I wanted to know the difference between sklearn LabelEncoder vs pandas get_dummies. Why would one choose LabelEncoder over get_dummies. What is the advantage of using one over another? Disadvantages?
我想知道 sklearn LabelEncoder 与 pandas get_dummies 之间的区别。为什么会选择 LabelEncoder 而不是 get_dummies。使用一种比另一种有什么优势?缺点?
As far as I understand if I have a class A
据我了解,如果我有 A 类
ClassA = ["Apple", "Ball", "Cat"]
encoder = [1, 2, 3]
and
和
dummy = [001, 010, 100]
Am I understanding this incorrectly?
我是否错误地理解了这一点?
采纳答案by Ami Tavory
These are just convenience functions falling naturally into the way these two libraries tend to do things, respectively. The first one "condenses" the information by changing things to integers, and the second one "expands" the dimensions allowing (possibly) more convenient access.
这些只是自然而然地落入这两个库倾向于做事的方式中的便利函数,分别。第一个通过将事物更改为整数来“压缩”信息,第二个“扩展”维度允许(可能)更方便的访问。
sklearn.preprocessing.LabelEncoder
simply transforms data, from whatever domain, so that its domain is 0, ..., k - 1, where kis the number of classes.
sklearn.preprocessing.LabelEncoder
简单地转换来自任何域的数据,使其域为0, ..., k - 1,其中k是类的数量。
So, for example
所以,例如
["paris", "paris", "tokyo", "amsterdam"]
could become
可以成为
[0, 0, 1, 2]
pandas.get_dummies
also takes a Series with elements from some domain, but expands it into a DataFrame whose columns correspond to the entries in the series, and the values are 0 or 1 depending on what they originally were. So, for example, the same
pandas.get_dummies
还采用一个包含来自某个域的元素的系列,但将其扩展为一个 DataFrame,其列对应于系列中的条目,值是 0 或 1,具体取决于它们最初是什么。因此,例如,相同的
["paris", "paris", "tokyo", "amsterdam"]
would become a DataFrame with labels
将成为带有标签的 DataFrame
["paris", "tokyo", "amsterdam"]
and whose "paris"
entry would be the series
谁的"paris"
条目将是系列
[1, 1, 0, 0]
The main advantage of the first method is that it conserves space. Conversely, encoding things as integers might give the impression (to you or to some machine learning algorithm) that the order means something. Is "amsterdam" closer to "tokyo" than to "paris" just because of the integer encoding? probably not. The second representation is a bit clearer on that.
第一种方法的主要优点是节省空间。相反,将事物编码为整数可能会给(对您或某些机器学习算法)印象,即顺序意味着什么。仅仅因为整数编码,“amsterdam”是否更接近“tokyo”而不是“paris”?可能不是。第二种表示在这方面更清楚一些。
回答by Yuchao Jiang
pandas.get_dummies
is one-hot encoding but sklearn.preprocessing.LabelEncoder
is incremental encoding, such as 0,1,2,3,4,...
pandas.get_dummies
是one-hot编码但是sklearn.preprocessing.LabelEncoder
是增量编码,比如0,1,2,3,4,...
one-hot encoding is more suitable for machine learning. Because labels are independent to each other, e.g. 2 doesn't mean twice that value of 1.
one-hot encoding 更适合机器学习。因为标签是相互独立的,例如 2 并不意味着 1 的两倍。
If the training set and test set have a different number of classes for the same feature, please refer to Keep same dummy variable in training and testing datafor two solutions.
如果训练集和测试集对于同一特征的类数不同,请参阅在训练和测试数据中保持相同的虚拟变量两种解决方案。