pandas 将列中的字符串转换为分类变量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38677615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert strings in column into categorical variable
提问by Provisional.Modulation
I'd like to transform columns filled with strings into categorical variables so that I could run statistics. However, I am having difficulty with this transformation because I'm fairly new to Python.
我想将填充字符串的列转换为分类变量,以便我可以运行统计数据。但是,我在进行这种转换时遇到了困难,因为我对 Python 还很陌生。
Here is a sample of my code:
这是我的代码示例:
# Open txt file and provide column names
data = pd.read_csv('sample.txt', sep="\t", header = None,
names = ["Label", "I1", "I2", "C1", "C2"])
# Convert I1 and I2 to continuous, numeric variables
data = data.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# Convert Label, C1, and C2 to categorical variables
data["Label"] = pd.factorize(data.Label)[0]
data["C1"] = pd.factorize(data.C1)[0]
data["C2"] = pd.factorize(data.C2)[0]
# Split the predictors into training/testing sets
predictors = data.drop('Label', 1)
msk = np.random.rand(len(predictors)) < 0.8
predictors_train = predictors[msk]
predictors_test = predictors[~msk]
# Split the response variable into training/testing sets
response = data['Label']
ksm = np.random.rand(len(response)) < 0.8
response_train = response[ksm]
response_test = response[~ksm]
# Logistic Regression
from sklearn import linear_model
# Create logistic regression object
lr = linear_model.LogisticRegression()
# Train the model using the training sets
lr.fit(predictors_train, response_train)
However, I'd get this error:
但是,我会收到此错误:
ValueError: could not convert string to float: 'ec26ad35'
The ec26ad35
value is a string from the categorical variables C1
and C2
. I'm not sure what's going on: Didn't I already convert the strings into categorical variables? Why does the error say that they're still strings?
该ec26ad35
值是来自分类变量C1
和的字符串C2
。我不确定发生了什么:我不是已经将字符串转换为分类变量了吗?为什么错误说它们仍然是字符串?
Using data.head(30)
, this is my data:
使用data.head(30)
,这是我的数据:
>> data[["Label", "I1", "I2", "C1", "C2"]].head(30)
Label I1 I2 C1 C2
0 0 1.0 1 68fd1e64 80e26c9b
1 0 2.0 0 68fd1e64 f0cf0024
2 0 2.0 0 287e684f 0a519c5c
3 0 NaN 893 68fd1e64 2c16a946
4 0 3.0 -1 8cf07265 ae46a29d
5 0 NaN -1 05db9164 6c9c9cf3
6 0 NaN 1 439a44a4 ad4527a2
7 1 1.0 4 68fd1e64 2c16a946
8 0 NaN 44 05db9164 d833535f
9 0 NaN 35 05db9164 510b40a5
10 0 NaN 2 05db9164 0468d672
11 0 0.0 6 05db9164 9b5fd12f
12 1 0.0 -1 241546e0 38a947a1
13 1 NaN 2 be589b51 287130e0
14 0 0.0 51 5a9ed9b0 80e26c9b
15 0 NaN 2 05db9164 bc6e3dc1
16 1 1.0 987 68fd1e64 38d50e09
17 0 0.0 1 8cf07265 7cd19acc
18 0 0.0 24 05db9164 f0cf0024
19 0 7.0 102 3c9d8785 b0660259
20 1 NaN 47 1464facd 38a947a1
21 0 0.0 1 05db9164 09e68b86
22 0 NaN 0 05db9164 38a947a1
23 0 NaN 9 05db9164 08d6d899
24 0 0.0 1 5a9ed9b0 3df44d94
25 0 NaN 4 5a9ed9b0 09e68b86
26 1 0.0 1 8cf07265 942f9a8d
27 1 0.0 20 68fd1e64 38a947a1
28 1 0.0 78 68fd1e64 1287a654
29 1 3.0 0 05db9164 90081f33
Edit: Included error from imputing missing data after splitting dataframes into training and testing data sets. Not sure what's going on here too.
编辑:包括将数据帧拆分为训练和测试数据集后输入缺失数据的错误。也不知道这里发生了什么。
# Impute missing data
>> from sklearn.preprocessing import Imputer
>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>> predictors_train = imp.fit_transform(predictors_train)
TypeError: float() argument must be a string or a number, not 'function'
回答by Ami Tavory
As @ayhan noted in the comments, you probably want to use dummy variableshere. This is because it seems highly unlikely from your data that there is really any ordering in your text labels.
正如@ayhan 在评论中指出的那样,您可能想在这里使用虚拟变量。这是因为从您的数据看来,您的文本标签中确实存在任何排序的可能性很小。
This can easily be done via pandas.get_dummies
, e.g.:
这可以通过 轻松完成pandas.get_dummies
,例如:
pd.get_dummies(df.C1)
Note that this returns a regular DataFrame:
请注意,这将返回一个常规的 DataFrame:
>>> pd.get_dummies(df.C1).columns
Index([u'05db9164', u'1464facd', u'241546e0', u'287e684f', u'3c9d8785',
u'439a44a4', u'5a9ed9b0', u'68fd1e64', u'8cf07265', u'be589b51'],
dtype='object')
You'd probably want to use this with a horizontal concat
, therefore.
因此,您可能希望将其与水平 一起使用concat
。
If you actually are actually looking to transform the labels into something numeric (which does not seem likely), you might look at sklearn.preprocessing.LabelEncoder
.
如果您实际上希望将标签转换为数字(这似乎不太可能),您可能会查看sklearn.preprocessing.LabelEncoder
.