Python 转换熊猫数据框中的分类数据

Question

提问by Gilaztdinov Rustam

I have a dataframe with this type of data (too many columns):

我有一个包含此类数据的数据框（列太多）：

col1        int64
col2        int64
col3        category
col4        category
col5        category

Columns seems like this:

列看起来像这样：

Name: col3, dtype: category
Categories (8, object): [B, C, E, G, H, N, S, W]

I want to convert all value in columns to integer like this:

我想将列中的所有值转换为整数，如下所示：

[1, 2, 3, 4, 5, 6, 7, 8]

I solved this for one column by this:

我通过这个为一列解决了这个问题：

dataframe['c'] = pandas.Categorical.from_array(dataframe.col3).codes

Now I have two columns in my dataframe - old col3and new cand need to drop old columns.

现在我的数据框中有两列 - 旧列col3和新c列，需要删除旧列。

That's bad practice. It's work but in my dataframe many columns and I don't want do it manually.

这是不好的做法。它可以工作，但在我的数据框中有很多列，我不想手动完成。

How do this pythonic and just cleverly?

这个pythonic如何巧妙地做到这一点？

Answer 1

采纳答案by joris

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

首先，以一个绝对列转换为它的数字代码，你可以这样做更容易：dataframe['c'].cat.codes。
此外，可以使用select_dtypes. 这样，您可以对多个自动选择的列应用上述操作。

First making an example dataframe:

首先制作一个示例数据框：

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1       int64
col2    category
col3    category
dtype: object

Then by using select_dtypesto select the columns, and then applying .cat.codeson each of these columns, you can get the following result:

然后通过使用select_dtypes选择列，然后应用.cat.codes到这些列中的每一列，您可以得到以下结果：

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
   col1  col2  col3
0     1     0     0
1     2     1     1
2     3     2     0
3     4     0     1
4     5     1     1

Answer 2

回答by Abhishek

If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.

如果您只是担心创建一个额外的列并稍后删除它，那么首先不要使用新列。

dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes

You are done. Now as Categorical.from_arrayis deprecated, use Categoricaldirectly

你完成了。现在Categorical.from_array已弃用，Categorical直接使用

dataframe.col3 = pd.Categorical(dataframe.col3).codes

If you also need the mapping back from index to label, there is even better way for the same

如果您还需要从索引到标签的映射，还有更好的方法

dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()

check below

检查下面

print(dataframe)
print(mapping_index.get_loc("c"))

Answer 3

回答by Prohadoopian

@Quickbeam2k1 ,see below -

@Quickbeam2k1，见下文-

dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values

Using sklearn enter image description here

使用 sklearn 在此处输入图片说明

from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

Answer 4

回答by scottlittle

This works for me:

这对我有用：

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

Output:

输出：

[0, 1, 2, 0]

Answer 5

回答by shantanu pathak

Here multiple columns need to be converted. So, one approach i used is ..

这里需要转换多列。所以，我使用的一种方法是..

for col_name in df.columns:
    if(df[col_name].dtype == 'object'):
        df[col_name]= df[col_name].astype('category')
        df[col_name] = df[col_name].cat.codes

This converts all string / object type columns to categorical. Then applies codes to each type of category.

这会将所有字符串/对象类型列转换为分类列。然后将代码应用于每种类型的类别。

Answer 6

回答by Fatemeh Asgarinejad

For converting categorical data in column Cof dataset data, we need to do the following:

为了转换数据集data 的C列中的分类数据，我们需要执行以下操作：

from sklearn.preprocessing import LabelEncoder 
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.

Answer 7

回答by SaTa

For a certain column, if you don't care about the ordering, use this

对于某个列，如果您不关心排序，请使用此

df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])

If you care about the ordering, specify them as a list and use this

如果您关心顺序，请将它们指定为列表并使用它

df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))

Python 转换熊猫数据框中的分类数据

提问by Gilaztdinov Rustam

采纳答案by joris

回答by Abhishek

回答by Prohadoopian

回答by scottlittle

回答by shantanu pathak

回答by Fatemeh Asgarinejad

回答by SaTa

相关推荐

最近更新

标签

Python 转换熊猫数据框中的分类数据

提问by Gilaztdinov Rustam

采纳答案by joris

回答by Abhishek

回答by Prohadoopian

回答by scottlittle

回答by shantanu pathak

回答by Fatemeh Asgarinejad

回答by SaTa

相关推荐

在python中旋转的高斯消除

Python 如何减去两个字符串？

Python BeautifulSoup - 按标签内的文本搜索

Python 如何计算没有空格的字符串中的字母数？

相关推荐

最近更新

标签