pandas 在 scikit-learn 中使用多个功能

Question

提问by James Daily

I'm working on text classification using scikit-learn. Things work well with a single feature, but introducing multiple features is giving me errors. I think the problem is that I'm not formatting the data in the way that the classifier expects.

我正在使用 scikit-learn 进行文本分类。单个功能可以很好地工作，但是引入多个功能会给我带来错误。我认为问题在于我没有按照分类器期望的方式格式化数据。

For example, this works fine:

例如，这工作正常：

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

But this:

但是这个：

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

dies with

死于

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

during the preprocessing stage after classifier.fit() is called. I think the problem is that way I'm formatting the data, but I can't figure out how to get it right.

在调用classifier.fit() 之后的预处理阶段。我认为问题在于我格式化数据的方式，但我不知道如何正确处理。

feature1 and feature2 are both English text strings, as is the target. I'm using LabelEncoder() to encode target, which seems to work fine.

feature1 和 feature2 都是英文文本字符串，目标也是。我正在使用 LabelEncoder() 对目标进行编码，这似乎工作正常。

Here's an example of what print datareturns, to give you a sense of how it's formatted right now.

这是print data返回内容的示例，让您了解它现在的格式。

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]

Answer 1

回答by ely

The particular error message makes it seem like your code somewhere expects something to be a str(so that .lowermay be called) but instead it is receiving a whole array (probably a whole array of strs).

特定的错误消息使您的代码看起来好像某处的代码期望某事是 a str（因此.lower可能会被调用），但它接收的是整个数组（可能是整个strs数组）。

Can you edit the question to better describe the data and also post the full traceback, not just the small part with the named error?

您能否编辑问题以更好地描述数据并发布完整的回溯，而不仅仅是带有命名错误的小部分？

In the meantime, you can also try

同时，您也可以尝试

data = df[['feature1', 'feature2']].values

and

和

df['target'].values

instead of explicitly casting to np.ndarrayyourself.

而不是明确地投射给np.ndarray自己。

It looks to me like an array is being made where it is 1x1 and the singleton element in the "array" is itself an ndarray.

在我看来，正在制作一个数组，其中它是 1x1，而“数组”中的单例元素本身就是一个ndarray.

Answer 2

回答by Anthony De Meulemeester

If your text columns have the same encoder / transformer, merge the columns together.

如果您的文本列具有相同的编码器/转换器，请将这些列合并在一起。

data = np.append(df.feature1. df.feature2)

pandas 在 scikit-learn 中使用多个功能

提问by James Daily

回答by ely

回答by Anthony De Meulemeester

相关推荐

最近更新

标签

pandas 在 scikit-learn 中使用多个功能

提问by James Daily

回答by ely

回答by Anthony De Meulemeester

相关推荐

pandas 如何计算pandas groupby中的所有正值和负值？

pandas 如何在熊猫图中显示中文？

从 Pandas 回归中获取要绘制的回归线

Pandas Statsmodels ols 使用 DF 预测器进行回归预测？

相关推荐

最近更新

标签