Python Scikit-Learn 的流水线:传递了一个稀疏矩阵,但需要密集数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28384680/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required
提问by Ada Stra
I'm finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial). It's python 3.4.2:
我发现很难理解如何修复我创建的管道(阅读:主要从教程中粘贴)。这是python 3.4.2:
df = pd.DataFrame
df = DataFrame.from_records(train)
test = [blah1, blah2, blah3]
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])
pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)
When I run it, I get:
当我运行它时,我得到:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
.
这是为行pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
。
I've experimented a lot with solutions through numpy, scipy, and so forth, but I still don't know how to fix it. And yes, similar questions have come up before, but not inside a pipeline.
Where is it that I have to apply toarray
or todense
?
我已经通过 numpy、scipy 等尝试了很多解决方案,但我仍然不知道如何解决它。是的,以前也出现过类似的问题,但不是在管道内。我必须在哪里申请toarray
或todense
?
采纳答案by David Maust
Unfortunately those two are incompatible. A CountVectorizer
produces a sparse matrix and the RandomForestClassifier requires a dense matrix. It is possible to convert using X.todense()
. Doing this will substantially increase your memory footprint.
不幸的是,这两者是不相容的。ACountVectorizer
产生一个稀疏矩阵,而 RandomForestClassifier 需要一个密集矩阵。可以使用X.todense()
. 这样做会大大增加您的内存占用。
Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.htmlwhich allows you to call .todense()
in a pipeline stage.
以下是基于http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html执行此操作的示例代码,它允许您.todense()
在管道阶段调用。
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
Once you have your DenseTransformer
, you are able to add it as a pipeline step.
拥有 . 后DenseTransformer
,您就可以将其添加为管道步骤。
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
('classifier', RandomForestClassifier())
])
Another option would be to use a classifier meant for sparse data like LinearSVC
.
另一种选择是使用用于稀疏数据的分类器,例如LinearSVC
.
from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])
回答by JAB
you can change pandas Series
to arrays using the .values
method.
您可以Series
使用该.values
方法将熊猫更改为数组。
pipeline.fit(df[0].values, df[1].values)
However I think the issue here happens because CountVectorizer()
returns a sparse matrix by default, and cannot be piped to the RF classifier. CountVectorizer()
does have a dtype
parameter to specify the type of array returned. That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long
但是我认为这里的问题是因为CountVectorizer()
默认情况下返回一个稀疏矩阵,并且不能通过管道传输到 RF 分类器。CountVectorizer()
确实有一个dtype
参数来指定返回的数组类型。这就是说通常你需要做某种降维才能使用随机森林进行文本分类,因为词袋特征向量很长
回答by Gilles Louppe
Random forests in 0.16-dev now accept sparse data.
0.16-dev 中的随机森林现在接受稀疏数据。
回答by maxymoo
The most terse solution would be use a FunctionTransformer
to convert to dense: this will automatically implement the fit
, transform
and fit_transform
methods as in David's answer. Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline
convenience function to enable a more minimalist language for describing the model:
最简洁的解决方案是使用 aFunctionTransformer
转换为密集:这将自动实现大卫的答案中的fit
,transform
和fit_transform
方法。此外,如果我的管道步骤不需要特殊名称,我喜欢使用sklearn.pipeline.make_pipeline
便利功能来启用更简约的语言来描述模型:
from sklearn.preprocessing import FunctionTransformer
pipeline = make_pipeline(
CountVectorizer(),
FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
RandomForestClassifier()
)
回答by Max Kleiner
with this pipeline add TfidTransformer plus
使用此管道添加 TfidTransformer plus
pipelinex = Pipeline([('bow',vectorizer),
('tfidf',TfidfTransformer()),
('to_dense', DenseTransformer()),
('classifier',classifier)])