pandas 类型错误：预期序列或类似数组，得到了估计器

Question

提问by Deepak Puthraya

I am working on a project that has user reviews on products. I am using TfidfVectorizer to extract features from my dataset apart from some other features that I have extracted manually.

我正在从事一个对产品有用户评论的项目。除了我手动提取的一些其他特征之外，我正在使用 TfidfVectorizer 从我的数据集中提取特征。

df = pd.read_csv('reviews.csv', header=0)

FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()

vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)

X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse =  vectorizer.inverse_transform(X)

for i, row in df.iterrows():
    for f in features:
        df.set_value(i, f, False)
    for inv in inverse[i]:
        df.set_value(i, inv, True)

train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)

The above code works fine. But when I change the max_featuresfrom 45 to anything higher I get an error on tran_test_splitline.

上面的代码工作正常。但是当我将max_features45更改为更高的值时，我tran_test_split在线上出现错误。

The error is:

错误是：

Traceback (most recent call last): File "analysis.py", line 120, in <module> train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split arrays = indexable(*arrays) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable check_consistent_length(*result) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length uniques = np.unique([_num_samples(X) for X in arrays if X is not None]) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples 'estimator %s' % x) TypeError: Expected sequence or array-like, got estimator

I am not sure what exactly is changing when I change increase the max_featuressize.

我不确定当我改变增加max_features大小时到底发生了什么变化。

Let me know if you need more data or if I have missed something

如果您需要更多数据或我遗漏了什么，请告诉我

Answer 1

回答by elz

I know this is old, but I had the same issue and while the answer from @shahins works, I wanted something that would keep the dataframe object so I can have my indexing in the train/test splits.

我知道这很旧，但我遇到了同样的问题，虽然@shahins 的答案有效，但我想要一些可以保留数据帧对象的东西，这样我就可以在训练/测试拆分中建立索引。

Solution:

解决方案：

Rename the dataframe column fit as something (anything) else:

将数据框列重命名为其他内容（任何内容）：

df = df.rename(columns = {'fit': 'fit_feature'})

Why it works:

为什么有效：

It isn't actually the number of features that is the issue, it is one feature in particular that is causing the problem. I'm guessing you are getting the word "fit" as one of your text features (and it didn't show up with the lower max_featuresthreshold).

实际上，问题并不是功能的数量，而是导致问题的特定功能。我猜你把“适合”这个词作为你的文本特征之一（它没有出现在较低的max_features阈值下）。

Looking at the sklearn source code, it checks to make sure you are not passing an sklearn estimator by testing to see if the any of your objects have a "fit" attribute. The code is checking for the fitmethod of an sklearn estimator, but will also raise an exception when you have a fitcolumn of the dataframe (remember df.fitand df['fit']both select the "fit" column).

查看 sklearn 源代码，它通过测试来检查您的任何对象是否具有“适合”属性，以确保您没有通过 sklearn 估计器。该代码正在检查fitsklearn 估计器的方法，但当您有fit数据框的一列时也会引发异常（请记住df.fit，df['fit']两者都选择“适合”列）。

Answer 2

回答by shahins

I had this issue and I tried something like this and it worked for me:

我遇到了这个问题，我尝试了类似的方法，它对我有用：

train_test_split(df.as_matrix(), test_size = 0.2, random_state=700)

Answer 3

回答by Pratibha

train_test_split(x.as_matrix(), y.as_matrix(), test_size=0.2, random_state=0)

This worked for me.

这对我有用。

pandas 类型错误：预期序列或类似数组，得到了估计器

提问by Deepak Puthraya

回答by elz

Solution:

解决方案：

Why it works:

为什么有效：

回答by shahins

回答by Pratibha

相关推荐

最近更新

标签

pandas 类型错误：预期序列或类似数组，得到了估计器

提问by Deepak Puthraya

回答by elz

Solution:

解决方案：

Why it works:

为什么有效：

回答by shahins

回答by Pratibha

相关推荐

Python Pandas 数据框 sort_values 不起作用

pandas 如何在 IronPython 中安装包/模块

pandas 从 DatetimeIndex 到时间列表

熊猫日均值，pandas.resample

相关推荐

最近更新

标签