将 Pandas 数据框的多列转换为虚拟变量 - Python

Question

提问by CreamStat

I have this dataframe:

我有这个数据框：

enter image description here

在此处输入图片说明

As far as I know, to use the scikit learn package in Python for machine leaning tasks, the categorical variables should be converted to dummy variables. So, for example, using a library of scikit learn I try to convert the values of the third column to dummy values but my code didn't work:

据我所知，要使用 Python 中的 scikit 学习包进行机器学习任务，分类变量应该转换为虚拟变量。因此，例如，使用 scikit learn 库，我尝试将第三列的值转换为虚拟值，但我的代码不起作用：

from sklearn.preprocessing import LabelEncoder

x[:, 2] = LabelEncoder().fit_transform(x[:,2])

So what's wrong with my code? and How Can I convert all the categorical variables to dummy variables in my data frame?

那么我的代码有什么问题？以及如何将数据框中的所有分类变量转换为虚拟变量？

Edit: The full traceback is this :

编辑：完整的回溯是这样的：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-73-c0d726db979e> in <module>()
      1 from sklearn.preprocessing import LabelEncoder
      2 
----> 3 x[:, 2] = LabelEncoder().fit_transform(x[:,2])

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item)
   1653     def get(self, item):
   1654         if self.items.is_unique:
-> 1655             _, block = self._find_block(item)
   1656             return block.get(item)
   1657         else:

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in _find_block(self, item)
   1933 
   1934     def _find_block(self, item):
-> 1935         self._check_have(item)
   1936         for i, block in enumerate(self.blocks):
   1937             if item in block:

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in _check_have(self, item)
   1939 
   1940     def _check_have(self, item):
-> 1941         if item not in self.items:
   1942             raise KeyError('no item named %s' % com.pprint_thing(item))
   1943 

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\index.pyc in __contains__(self, key)
    317 
    318     def __contains__(self, key):
--> 319         hash(key)
    320         # work around some kind of odd cython bug
    321         try:

TypeError: unhashable type

Answer 1

回答by omun

I don't think the LabelEncoderfunction transforms your data to dummy variables (see scikit-learn.org/LabelEncoder) but creates new numerical labels for the variable.

我认为该LabelEncoder函数不会将您的数据转换为虚拟变量（请参阅scikit-learn.org/LabelEncoder），而是为变量创建新的数字标签。

I use the get_dummiesfunction from pandas to do this (see pandas.pydata.org/dummies). Below a simple example.

我使用get_dummiespandas 中的函数来执行此操作（请参阅pandas.pydata.org/dummies）。下面举个简单的例子。

Create a simple DataFramewith categorical and numerical data

创建一个简单DataFrame的分类和数值数据

import pandas as pd
X = pd.DataFrame({"Var1": ["a", "a", "b"],
                  "Var2": ["a", "b", "c"],
                  "Var3": [1, 2, 3]},
                  dtype = "category")
X["Var3"] = X["Var3"].astype(int)

Transform data to dummy variables

将数据转换为虚拟变量

pd.get_dummies(X)

Out[4]:

出[4]：

   Var3  Var1_a  Var1_b  Var2_a  Var2_b  Var2_c
0     1       1       0       1       0       0
1     2       1       0       0       1       0
2     3       0       1       0       0       1

Notice that Var1was transformed to two dummy variables, but you might want to have all three categories [a, b, c]. You will need to add the new category.

请注意，它Var1已转换为两个虚拟变量，但您可能希望拥有所有三个类别[a, b, c]。您将需要添加新类别。

X["Var1"].cat.add_categories("c", inplace=True)

And the result:

结果：

pd.get_dummies(X)

Out[6]:

出[6]：

   Var3  Var1_a  Var1_b  Var1_c  Var2_a  Var2_b  Var2_c
0     1       1       0       0       1       0       0
1     2       1       0       0       0       1       0
2     3       0       1       0       0       0       1

Hope this helps

希望这可以帮助

将 Pandas 数据框的多列转换为虚拟变量 - Python

提问by CreamStat

回答by omun

相关推荐

最近更新

标签

将 Pandas 数据框的多列转换为虚拟变量 - Python

提问by CreamStat

回答by omun

相关推荐

如何使用 pandas.date_range() 在指定的开始日期和结束日期之间获取具有 n 个指定周期（相等）的时间序列

pandas 过滤数据以仅获取当月行的第一天

无法在 Pandas python 中绘制我的数据

pandas 对熊猫数据框中的每一行进行排序的最快方法

相关推荐

最近更新

标签