Python 使用 Pandas 将分类值转换为二进制值

Question

提问by Rkz

I am trying to convert categorical values into binary values using pandas. The idea is to consider every unique categorical value as a feature (i.e. a column) and put 1 or 0 depending on whether a particular object (i.e. row) was assigned to this category. The following is the code:

我正在尝试使用 Pandas 将分类值转换为二进制值。这个想法是将每个唯一的分类值视为一个特征（即列），并根据是否将特定对象（即行）分配给该类别来放置 1 或 0。以下是代码：

data = pd.read_csv('somedata.csv')
converted_val = data.T.to_dict().values()
vectorizer = DV( sparse = False )
vec_x = vectorizer.fit_transform( converted_val )
numpy.savetxt('out.csv',vec_x,fmt='%10.0f',delimiter=',')

My question is, how to save this converted data with the column names?. In the above code, I am able to save the data using numpy.savetxtfunction, but this simply saves the array and the column names are lost. Alternatively, is there a much efficient way to perform the above operation?.

我的问题是，如何用列名保存这个转换后的数据？在上面的代码中，我可以使用numpy.savetxt函数保存数据，但这只是保存了数组并且列名丢失了。或者，有没有更有效的方法来执行上述操作？

Answer 1

采纳答案by YS-L

It seems that you are using scikit-learn's DictVectorizerto convert the categorical values to binary. In that case, to store the result along with the new column names, you can construct a new DataFrame with values from vec_xand columns from DV.get_feature_names(). Then, store the DataFrame to disk (e.g. with to_csv()) instead of the numpy array.

您似乎正在使用 scikit-learnDictVectorizer将分类值转换为二进制值。在这种情况下，要将结果与新列名称一起存储，您可以使用来自的值vec_x和来自的列构造一个新的 DataFrame DV.get_feature_names()。然后，将 DataFrame 存储到磁盘（例如使用to_csv()）而不是 numpy 数组。

Alternatively, it is also possible to use pandasto do the encoding directly with the get_dummiesfunction:

或者，也可以pandas直接使用get_dummies函数进行编码：

import pandas as pd
data = pd.DataFrame({'T': ['A', 'B', 'C', 'D', 'E']})
res = pd.get_dummies(data)
res.to_csv('output.csv')
print res

Output:

输出：

   T_A  T_B  T_C  T_D  T_E
0    1    0    0    0    0
1    0    1    0    0    0
2    0    0    1    0    0
3    0    0    0    1    0
4    0    0    0    0    1

Answer 2

回答by YS-L

You mean "one-hot" encoding?

你的意思是“one-hot”编码？

Say you have the following dataset:

假设您有以下数据集：

import pandas as pd
df = pd.DataFrame([
            ['green', 1, 10.1, 0], 
            ['red', 2, 13.5, 1], 
            ['blue', 3, 15.3, 0]])

df.columns = ['color', 'size', 'prize', 'class label']
df

Now, you have multiple options ...

现在，您有多种选择...

A) The Tedious Approach

A) 乏味的方法

color_mapping = {
           'green': (0,0,1),
           'red': (0,1,0),
           'blue': (1,0,0)}

df['color'] = df['color'].map(color_mapping)
df

import numpy as np
y = df['class label'].values
X = df.iloc[:, :-1].values
X = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)

print('Class labels:', y)
print('\nFeatures:\n', X)

Yielding:

产量：

Class labels: [0 1 0]

Features:
 [[  0.    0.    1.    1.   10.1]
 [  0.    1.    0.    2.   13.5]
 [  1.    0.    0.    3.   15.3]]

B) Scikit-learn's `DictVectorizer`

B) Scikit-learn 的 `DictVectorizer`

from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)

X = dvec.fit_transform(df.transpose().to_dict().values())
X

Yielding:

产量：

array([[  0. ,   0. ,   1. ,   0. ,  10.1,   1. ],
       [  1. ,   0. ,   0. ,   1. ,  13.5,   2. ],
       [  0. ,   1. ,   0. ,   0. ,  15.3,   3. ]])

C) Pandas' `get_dummies`

C) 熊猫 `get_dummies`

pd.get_dummies(df)

Python 使用 Pandas 将分类值转换为二进制值

提问by Rkz

采纳答案by YS-L

回答by YS-L

A) The Tedious Approach

A) 乏味的方法

B) Scikit-learn's `DictVectorizer`

B) Scikit-learn 的 `DictVectorizer`

C) Pandas' `get_dummies`

C) 熊猫 `get_dummies`

相关推荐

最近更新

标签

Python 使用 Pandas 将分类值转换为二进制值

提问by Rkz

采纳答案by YS-L

回答by YS-L

A) The Tedious Approach

A) 乏味的方法

B) Scikit-learn's DictVectorizer

B) Scikit-learn 的 DictVectorizer

C) Pandas' get_dummies

C) 熊猫 get_dummies

相关推荐

Python 访问数据帧的最后一个索引值

Python 如何在 Pandas 的特定列索引处插入一列？

Python 在 Tkinter 中使用 OpenCV

Python 如何运行康达？

相关推荐

最近更新

标签

B) Scikit-learn's `DictVectorizer`

B) Scikit-learn 的 `DictVectorizer`

C) Pandas' `get_dummies`

C) 熊猫 `get_dummies`