pandas 将列中的字符串转换为分类变量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38677615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:42:25  来源:igfitidea点击:

Convert strings in column into categorical variable

pythonstringpandasstatisticscategorical-data

提问by Provisional.Modulation

I'd like to transform columns filled with strings into categorical variables so that I could run statistics. However, I am having difficulty with this transformation because I'm fairly new to Python.

我想将填充字符串的列转换为分类变量,以便我可以运行统计数据。但是,我在进行这种转换时遇到了困难,因为我对 Python 还很陌生。

Here is a sample of my code:

这是我的代码示例:

# Open txt file and provide column names
data = pd.read_csv('sample.txt', sep="\t", header = None,
                   names = ["Label", "I1", "I2", "C1", "C2"])
# Convert I1 and I2 to continuous, numeric variables
data = data.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# Convert Label, C1, and C2 to categorical variables
data["Label"] = pd.factorize(data.Label)[0]
data["C1"] = pd.factorize(data.C1)[0]
data["C2"] = pd.factorize(data.C2)[0]

# Split the predictors into training/testing sets
predictors = data.drop('Label', 1)
msk = np.random.rand(len(predictors)) < 0.8
predictors_train = predictors[msk]
predictors_test = predictors[~msk]

# Split the response variable into training/testing sets
response = data['Label']
ksm = np.random.rand(len(response)) < 0.8
response_train = response[ksm]
response_test = response[~ksm]

# Logistic Regression
from sklearn import linear_model
# Create logistic regression object
lr = linear_model.LogisticRegression()

# Train the model using the training sets
lr.fit(predictors_train, response_train)

However, I'd get this error:

但是,我会收到此错误:

ValueError: could not convert string to float: 'ec26ad35'

The ec26ad35value is a string from the categorical variables C1and C2. I'm not sure what's going on: Didn't I already convert the strings into categorical variables? Why does the error say that they're still strings?

ec26ad35值是来自分类变量C1和的字符串C2。我不确定发生了什么:我不是已经将字符串转换为分类变量了吗?为什么错误说它们仍然是字符串?

Using data.head(30), this is my data:

使用data.head(30),这是我的数据:

>> data[["Label", "I1", "I2", "C1", "C2"]].head(30)
    Label   I1   I2        C1        C2
0       0  1.0    1  68fd1e64  80e26c9b
1       0  2.0    0  68fd1e64  f0cf0024
2       0  2.0    0  287e684f  0a519c5c
3       0  NaN  893  68fd1e64  2c16a946
4       0  3.0   -1  8cf07265  ae46a29d
5       0  NaN   -1  05db9164  6c9c9cf3
6       0  NaN    1  439a44a4  ad4527a2
7       1  1.0    4  68fd1e64  2c16a946
8       0  NaN   44  05db9164  d833535f
9       0  NaN   35  05db9164  510b40a5
10      0  NaN    2  05db9164  0468d672
11      0  0.0    6  05db9164  9b5fd12f
12      1  0.0   -1  241546e0  38a947a1
13      1  NaN    2  be589b51  287130e0
14      0  0.0   51  5a9ed9b0  80e26c9b
15      0  NaN    2  05db9164  bc6e3dc1
16      1  1.0  987  68fd1e64  38d50e09
17      0  0.0    1  8cf07265  7cd19acc
18      0  0.0   24  05db9164  f0cf0024
19      0  7.0  102  3c9d8785  b0660259
20      1  NaN   47  1464facd  38a947a1
21      0  0.0    1  05db9164  09e68b86
22      0  NaN    0  05db9164  38a947a1
23      0  NaN    9  05db9164  08d6d899
24      0  0.0    1  5a9ed9b0  3df44d94
25      0  NaN    4  5a9ed9b0  09e68b86
26      1  0.0    1  8cf07265  942f9a8d
27      1  0.0   20  68fd1e64  38a947a1
28      1  0.0   78  68fd1e64  1287a654
29      1  3.0    0  05db9164  90081f33

Edit: Included error from imputing missing data after splitting dataframes into training and testing data sets. Not sure what's going on here too.

编辑:包括将数据帧拆分为训练和测试数据集后输入缺失数据的错误。也不知道这里发生了什么。

# Impute missing data
>> from sklearn.preprocessing import Imputer
>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>> predictors_train = imp.fit_transform(predictors_train)
TypeError: float() argument must be a string or a number, not 'function'

回答by Ami Tavory

As @ayhan noted in the comments, you probably want to use dummy variableshere. This is because it seems highly unlikely from your data that there is really any ordering in your text labels.

正如@ayhan 在评论中指出的那样,您可能想在这里使用虚拟变量。这是因为从您的数据看来,您的文本标签中确实存在任何排序的可能性很小。

This can easily be done via pandas.get_dummies, e.g.:

这可以通过 轻松完成pandas.get_dummies,例如:

pd.get_dummies(df.C1)

Note that this returns a regular DataFrame:

请注意,这将返回一个常规的 DataFrame:

>>> pd.get_dummies(df.C1).columns
Index([u'05db9164', u'1464facd', u'241546e0', u'287e684f', u'3c9d8785',
     u'439a44a4', u'5a9ed9b0', u'68fd1e64', u'8cf07265', u'be589b51'],
     dtype='object')

You'd probably want to use this with a horizontal concat, therefore.

因此,您可能希望将其与水平 一起使用concat



If you actually are actually looking to transform the labels into something numeric (which does not seem likely), you might look at sklearn.preprocessing.LabelEncoder.

如果您实际上希望将标签转换为数字(这似乎不太可能),您可能会查看sklearn.preprocessing.LabelEncoder.