Python Scikit-Learn 的流水线：传递了一个稀疏矩阵，但需要密集数据

Question

提问by Ada Stra

I'm finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial). It's python 3.4.2:

我发现很难理解如何修复我创建的管道（阅读：主要从教程中粘贴）。这是python 3.4.2：

df = pd.DataFrame
df = DataFrame.from_records(train)

test = [blah1, blah2, blah3]

pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])

pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)

When I run it, I get:

当我运行它时，我得到：

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1])).

这是为行pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))。

I've experimented a lot with solutions through numpy, scipy, and so forth, but I still don't know how to fix it. And yes, similar questions have come up before, but not inside a pipeline. Where is it that I have to apply toarrayor todense?

我已经通过 numpy、scipy 等尝试了很多解决方案，但我仍然不知道如何解决它。是的，以前也出现过类似的问题，但不是在管道内。我必须在哪里申请toarray或todense？

Answer 1

采纳答案by David Maust

Unfortunately those two are incompatible. A CountVectorizerproduces a sparse matrix and the RandomForestClassifier requires a dense matrix. It is possible to convert using X.todense(). Doing this will substantially increase your memory footprint.

不幸的是，这两者是不相容的。ACountVectorizer产生一个稀疏矩阵，而 RandomForestClassifier 需要一个密集矩阵。可以使用X.todense(). 这样做会大大增加您的内存占用。

Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.htmlwhich allows you to call .todense()in a pipeline stage.

以下是基于http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html执行此操作的示例代码，它允许您.todense()在管道阶段调用。

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

Once you have your DenseTransformer, you are able to add it as a pipeline step.

拥有 . 后DenseTransformer，您就可以将其添加为管道步骤。

pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
     ('classifier', RandomForestClassifier())
])

Another option would be to use a classifier meant for sparse data like LinearSVC.

另一种选择是使用用于稀疏数据的分类器，例如LinearSVC.

from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])

Answer 2

回答by JAB

you can change pandas Seriesto arrays using the .valuesmethod.

您可以Series使用该.values方法将熊猫更改为数组。

pipeline.fit(df[0].values, df[1].values)

However I think the issue here happens because CountVectorizer()returns a sparse matrix by default, and cannot be piped to the RF classifier. CountVectorizer()does have a dtypeparameter to specify the type of array returned. That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long

但是我认为这里的问题是因为CountVectorizer()默认情况下返回一个稀疏矩阵，并且不能通过管道传输到 RF 分类器。CountVectorizer()确实有一个dtype参数来指定返回的数组类型。这就是说通常你需要做某种降维才能使用随机森林进行文本分类，因为词袋特征向量很长

Answer 3

回答by Gilles Louppe

Random forests in 0.16-dev now accept sparse data.

0.16-dev 中的随机森林现在接受稀疏数据。

Answer 4

回答by maxymoo

The most terse solution would be use a FunctionTransformerto convert to dense: this will automatically implement the fit, transformand fit_transformmethods as in David's answer. Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipelineconvenience function to enable a more minimalist language for describing the model:

最简洁的解决方案是使用 aFunctionTransformer转换为密集：这将自动实现大卫的答案中的fit,transform和fit_transform方法。此外，如果我的管道步骤不需要特殊名称，我喜欢使用sklearn.pipeline.make_pipeline便利功能来启用更简约的语言来描述模型：

from sklearn.preprocessing import FunctionTransformer

pipeline = make_pipeline(
     CountVectorizer(), 
     FunctionTransformer(lambda x: x.todense(), accept_sparse=True), 
     RandomForestClassifier()
)

Answer 5

回答by Max Kleiner

with this pipeline add TfidTransformer plus

使用此管道添加 TfidTransformer plus

        pipelinex = Pipeline([('bow',vectorizer),
                           ('tfidf',TfidfTransformer()),
                           ('to_dense', DenseTransformer()), 
                           ('classifier',classifier)])

Python Scikit-Learn 的流水线：传递了一个稀疏矩阵，但需要密集数据

提问by Ada Stra

采纳答案by David Maust

回答by JAB

回答by Gilles Louppe

回答by maxymoo

回答by Max Kleiner

相关推荐

最近更新

标签

Python Scikit-Learn 的流水线：传递了一个稀疏矩阵，但需要密集数据

提问by Ada Stra

采纳答案by David Maust

回答by JAB

回答by Gilles Louppe

回答by maxymoo

回答by Max Kleiner

相关推荐

Python 导入错误：没有名为“加密”的模块

Python 如何从 scikit-learn 解释决策树

如何在 Python 中创建全零数据框

Python 如何使用django将数据插入表中

相关推荐

最近更新

标签