pandas sklearn LabelEncoder 和 pd.get_dummies 有什么区别？

Question

提问by Sam

I wanted to know the difference between sklearn LabelEncoder vs pandas get_dummies. Why would one choose LabelEncoder over get_dummies. What is the advantage of using one over another? Disadvantages?

我想知道 sklearn LabelEncoder 与 pandas get_dummies 之间的区别。为什么会选择 LabelEncoder 而不是 get_dummies。使用一种比另一种有什么优势？缺点？

As far as I understand if I have a class A

据我了解，如果我有 A 类

ClassA = ["Apple", "Ball", "Cat"]
encoder = [1, 2, 3]

and

和

dummy = [001, 010, 100]

Am I understanding this incorrectly?

我是否错误地理解了这一点？

Answer 1

采纳答案by Ami Tavory

These are just convenience functions falling naturally into the way these two libraries tend to do things, respectively. The first one "condenses" the information by changing things to integers, and the second one "expands" the dimensions allowing (possibly) more convenient access.

这些只是自然而然地落入这两个库倾向于做事的方式中的便利函数，分别。第一个通过将事物更改为整数来“压缩”信息，第二个“扩展”维度允许（可能）更方便的访问。

sklearn.preprocessing.LabelEncodersimply transforms data, from whatever domain, so that its domain is 0, ..., k - 1, where kis the number of classes.

sklearn.preprocessing.LabelEncoder简单地转换来自任何域的数据，使其域为0, ..., k - 1，其中k是类的数量。

So, for example

所以，例如

["paris", "paris", "tokyo", "amsterdam"]

could become

可以成为

[0, 0, 1, 2]

pandas.get_dummiesalso takes a Series with elements from some domain, but expands it into a DataFrame whose columns correspond to the entries in the series, and the values are 0 or 1 depending on what they originally were. So, for example, the same

pandas.get_dummies还采用一个包含来自某个域的元素的系列，但将其扩展为一个 DataFrame，其列对应于系列中的条目，值是 0 或 1，具体取决于它们最初是什么。因此，例如，相同的

["paris", "paris", "tokyo", "amsterdam"]

would become a DataFrame with labels

将成为带有标签的 DataFrame

["paris", "tokyo", "amsterdam"]

and whose "paris"entry would be the series

谁的"paris"条目将是系列

[1, 1, 0, 0]

The main advantage of the first method is that it conserves space. Conversely, encoding things as integers might give the impression (to you or to some machine learning algorithm) that the order means something. Is "amsterdam" closer to "tokyo" than to "paris" just because of the integer encoding? probably not. The second representation is a bit clearer on that.

第一种方法的主要优点是节省空间。相反，将事物编码为整数可能会给（对您或某些机器学习算法）印象，即顺序意味着什么。仅仅因为整数编码，“amsterdam”是否更接近“tokyo”而不是“paris”？可能不是。第二种表示在这方面更清楚一些。

Answer 2

回答by Yuchao Jiang

pandas.get_dummiesis one-hot encoding but sklearn.preprocessing.LabelEncoderis incremental encoding, such as 0,1,2,3,4,...

pandas.get_dummies是one-hot编码但是sklearn.preprocessing.LabelEncoder是增量编码，比如0,1,2,3,4,...

one-hot encoding is more suitable for machine learning. Because labels are independent to each other, e.g. 2 doesn't mean twice that value of 1.

one-hot encoding 更适合机器学习。因为标签是相互独立的，例如 2 并不意味着 1 的两倍。

If the training set and test set have a different number of classes for the same feature, please refer to Keep same dummy variable in training and testing datafor two solutions.

如果训练集和测试集对于同一特征的类数不同，请参阅在训练和测试数据中保持相同的虚拟变量两种解决方案。

pandas sklearn LabelEncoder 和 pd.get_dummies 有什么区别？

提问by Sam

采纳答案by Ami Tavory

回答by Yuchao Jiang

相关推荐

最近更新

标签

pandas sklearn LabelEncoder 和 pd.get_dummies 有什么区别？

提问by Sam

采纳答案by Ami Tavory

回答by Yuchao Jiang

相关推荐

pandas 熊猫，使用 pd.to_hdf 将多个数据集存储在一个 h5 文件中

无法通过 python pandas 计算 MACD

pandas 两个数据点之间的线性插值

pandas ValueError：项目错误长度 907 而不是 2000

相关推荐

最近更新

标签