pandas 在 scikit-learn 中使用多个功能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21589177/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:39:59  来源:igfitidea点击:

Using multiple features with scikit-learn

pythonpandasmachine-learningscikit-learn

提问by James Daily

I'm working on text classification using scikit-learn. Things work well with a single feature, but introducing multiple features is giving me errors. I think the problem is that I'm not formatting the data in the way that the classifier expects.

我正在使用 scikit-learn 进行文本分类。单个功能可以很好地工作,但是引入多个功能会给我带来错误。我认为问题在于我没有按照分类器期望的方式格式化数据。

For example, this works fine:

例如,这工作正常:

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

But this:

但是这个:

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

dies with

死于

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

during the preprocessing stage after classifier.fit() is called. I think the problem is that way I'm formatting the data, but I can't figure out how to get it right.

在调用classifier.fit() 之后的预处理阶段。我认为问题在于我格式化数据的方式,但我不知道如何正确处理。

feature1 and feature2 are both English text strings, as is the target. I'm using LabelEncoder() to encode target, which seems to work fine.

feature1 和 feature2 都是英文文本字符串,目标也是。我正在使用 LabelEncoder() 对目标进行编码,这似乎工作正常。

Here's an example of what print datareturns, to give you a sense of how it's formatted right now.

这是print data返回内容的示例,让您了解它现在的格式。

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]

回答by ely

The particular error message makes it seem like your code somewhere expects something to be a str(so that .lowermay be called) but instead it is receiving a whole array (probably a whole array of strs).

特定的错误消息使您的代码看起来好像某处的代码期望某事是 a str(因此.lower可能会被调用),但它接收的是整个数组(可能是整个strs数组)。

Can you edit the question to better describe the data and also post the full traceback, not just the small part with the named error?

您能否编辑问题以更好地描述数据并发布完整的回溯,而不仅仅是带有命名错误的小部分?

In the meantime, you can also try

同时,您也可以尝试

data = df[['feature1', 'feature2']].values

and

df['target'].values

instead of explicitly casting to np.ndarrayyourself.

而不是明确地投射给np.ndarray自己。

It looks to me like an array is being made where it is 1x1 and the singleton element in the "array" is itself an ndarray.

在我看来,正在制作一个数组,其中它是 1x1,而“数组”中的单例元素本身就是一个ndarray.

回答by Anthony De Meulemeester

If your text columns have the same encoder / transformer, merge the columns together.

如果您的文本列具有相同的编码器/转换器,请将这些列合并在一起。

data = np.append(df.feature1. df.feature2)