Python - sklearn.pipeline.Pipeline 到底是什么？

Question

提问by farhawa

I can't figure out how the sklearn.pipeline.Pipelineworks exactly.

我无法弄清楚具体是如何sklearn.pipeline.Pipeline工作的。

There are a few explanation in the doc. For example what do they mean by:

文档中有一些解释。例如，它们是什么意思：

Pipeline of transforms with a final estimator.

带有最终估计器的变换管道。

To make my question clearer, what are steps? How do they work?

为了让我的问题更清楚，什么是steps？它们是如何工作的？

Edit

编辑

Thanks to the answers I can make my question clearer:

感谢这些答案，我可以让我的问题更清楚：

When I call pipeline and pass, as steps, two transformers and one estimator, e.g:

当我调用管道并通过步骤时，两个转换器和一个估计器，例如：

pipln = Pipeline([("trsfm1",transformer_1),
                  ("trsfm2",transformer_2),
                  ("estmtr",estimator)])

What happens when I call this?

当我调用它时会发生什么？

pipln.fit()
OR
pipln.fit_transform()

I can't figure out how an estimator can be a transformer and how a transformer can be fitted.

我无法弄清楚估算器如何成为变压器以及如何安装变压器。

Answer 1

采纳答案by Ibraim Ganiev

Transformerin scikit-learn - some class that have fit and transform method, or fit_transform method.

scikit-learn 中的转换器- 一些具有 fit 和 transform 方法或 fit_transform 方法的类。

Predictor- some class that has fit and predict methods, or fit_predict method.

预测器- 一些具有 fit 和 predict 方法或 fit_predict 方法的类。

Pipelineis just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.

流水线只是一个抽象的概念，它不是一些现有的机器学习算法。通常在 ML 任务中，您需要在应用最终估计器之前对原始数据集执行一系列不同的转换（查找特征集、生成新特征、仅选择一些好的特征）。

Hereis a good example of Pipeline usage. Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:

这是管道使用的一个很好的例子。Pipeline 为您提供了所有 3 个转换步骤和结果估算器的单一界面。它在内部封装了转换器和预测器，现在您可以执行以下操作：

    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

With just:

只需：

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor. Answer to edit: When you call pipln.fit()- each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.

使用管道，您可以轻松地对该元估计器的每个步骤的参数集执行网格搜索。如上面的链接所述。除了最后一步之外的所有步骤都必须是变换，最后一步可以是变换器或预测器。 编辑答案：当您打电话时pipln.fit()- 管道内的每个变压器都将安装在前一个变压器的输出上（第一个变压器是在原始数据集上学习的）。最后一个估计器可能是转换器或预测器，只有当最后一个估计器是转换器（实现 fit_transform，或分别实现转换和拟合方法）时，您才能在管道上调用 fit_transform()，只有在以下情况下，您才能在管道上调用 fit_predict() 或 predict()您的最后一个估算器是预测器。所以你不能在管道上调用 fit_transform 或转换，其中最后一步是预测器。

Answer 2

回答by NBartley

I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:

我认为 M0rkHaV 的想法是正确的。Scikit学习的管道类是用于封装多个不同的变压器旁边的估计到一个对象，一个有用的工具，这样你只需要打电话给你一次（重要的方法fit()，predict()等等）。让我们分解两个主要组成部分：

Transformersare classes that implement both fit()and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizerand Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimatorsare classes that implement both fit()and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().

转换器是同时实现fit()和的类transform()。您可能熟悉一些 sklearn 预处理工具，例如TfidfVectorizer和Binarizer。如果您查看这些预处理工具的文档，您会发现它们实现了这两种方法。我觉得很酷的是一些估计器也可以用作转换步骤，例如LinearSVC！
估计器是同时实现fit()和的类predict()。您会发现许多分类器和回归模型都实现了这两种方法，因此您可以轻松测试许多不同的模型。可以使用另一个转换器作为最终估计器（即，它不一定实现predict()，但肯定实现fit()）。所有这一切意味着您将无法调用predict().

As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.

至于您的编辑：让我们来看一个基于文本的示例。使用 LabelBinarizer，我们希望将标签列表转换为二进制值列表。

bin = LabelBinarizer()  #first we initialize

vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized

Now, when the binarizer is fitted on some data, it will have a structure called classes_that contains the unique classes that the transformer 'knows' about. Without calling fit()the binarizer has no idea what the data looks like, so calling transform()wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.

现在，当二值化器安装在某些数据上时，它将具有一个名为的结构classes_，其中包含转换器“知道”的唯一类。不调用fit()二值化器不知道数据是什么样的，所以调用transform()没有任何意义。如果您在尝试拟合数据之前打印出类列表，这是正确的。

print bin.classes_

I get the following error when trying this:

尝试此操作时出现以下错误：

AttributeError: 'LabelBinarizer' object has no attribute 'classes_'

But when you fit the binarizer on the veclist:

但是，当您将二值化器放在vec列表中时：

bin.fit(vec)

and try again

然后再试一次

print bin.classes_

I get the following:

我得到以下信息：

['cat' 'dog']


print bin.transform(vec)

And now, after calling transform on the vecobject, we get the following:

现在，在对vec对象调用 transform 之后，我们得到以下信息：

[[0]
 [1]
 [1]
 [1]]

As for estimators being used as transformers, let us use the DecisionTreeclassifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the treefound useful for predicting. When you call transform()on a Decision Tree, it will take your input data and find what itthinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.

至于用作转换器的估计器，让我们使用DecisionTree分类器作为特征提取器的示例。决策树很棒的原因有很多，但就我们的目的而言，重要的是它们能够对决策树认为对预测有用的特征进行排名。当您调用transform()决策树时，它将获取您的输入数据并找到它认为最重要的特征。因此，您可以考虑将数据矩阵（n 行 x m 列）转换为较小的矩阵（n 行 x k 列），其中 k 列是决策树发现的 k 个最重要的特征。

Answer 3

回答by Guillaume Chevalier

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.

ML 算法通常处理表格数据。您可能希望在 ML 算法之前和之后对这些数据进行预处理和后处理。管道是链接这些数据处理步骤的一种方式。

What are ML pipelines and how do they work?

什么是机器学习管道，它们是如何工作的？

A pipeline is a series of steps in which data is transformed. It comes from the old "pipe and filter" design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Machine Learning pipelines as:

管道是转换数据的一系列步骤。它来自旧的“管道和过滤器”设计模式（例如，您可以想到带有管道“|”或重定向运算符“>”的 unix bash 命令）。但是，管道是代码中的对象。因此，您可以为每个过滤器（也称为每个管道步骤）创建一个类，然后是另一个类以将这些步骤组合到最终管道中。一些管道可能会串联或并联其他管道，具有多个输入或输出，等等。我们喜欢将机器学习管道视为：

Pipe and filters. The pipeline's steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step's output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That's an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.

管道和过滤器。管道的步骤处理数据，并管理可以从数据中学习的内部状态。
复合材料。管道可以嵌套：例如，可以将整个管道视为另一个管道中的单个管道步骤。管道步骤不一定是管道，但管道本身至少是定义的管道步骤。
有向无环图 (DAG)。一个流水线步骤的输出可能会被发送到许多其他步骤，然后结果输出可以重新组合，依此类推。旁注：尽管管道是非循环的，但它们可以一个一个地处理多个项目，并且如果它们的状态发生变化（例如：每次使用 fit_transform 方法），那么它们可以被视为随着时间循环展开，保持它们的状态（想想像RNN）。这是一种有趣的方式，可以在将在线学习投入生产并在更多数据上对其进行培训时查看进行在线学习的管道。

Methods of a Scikit-Learn Pipeline

Scikit-Learn 管道的方法

Pipelines (or steps in the pipeline) must have those two methods:

管道（或管道中的步骤）必须具有这两种方法：

“fit” to learn on the data and acquire state (e.g.: neural network's neural weights are such state)
“transform" (or "predict") to actually process the data and generate a prediction.

“ fit” 学习数据并获取状态（例如：神经网络的神经权重就是这样的状态）
“转换”（或“预测”）以实际处理数据并生成预测。

It's also possible to call this method to chain both:

也可以调用此方法来链接两者：

“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.

“ fit_transform”来拟合然后转换数据，但在一次传递中，当必须直接一个接一个地完成这两种方法时，这允许潜在的代码优化。

Problems of the sklearn.pipeline.Pipeline class

sklearn.pipeline.Pipeline 类的问题

Scikit-Learn's “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?

Scikit-Learn 的“管道和过滤器”设计模式非常漂亮。但是如何将其用于深度学习、AutoML 和复杂的生产级管道？

Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it's one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it's what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:

Scikit-Learn 于 2007 年首次发布，这是一个前深度学习时代。然而，它是最著名和采用的机器学习库之一，并且仍在增长。最重要的是，它使用管道和过滤器设计模式作为一种软件架构风格——这就是让 Scikit-Learn 如此出色的原因，此外它还提供了可供使用的算法。但是，在执行以下操作时存在大量问题，而我们在 2020 年就应该能够做到：

Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.

自动机器学习（AutoML），
深度学习管道，
更复杂的机器学习管道。

Solutions that we've Found to Those Scikit-Learn's Problems

我们为那些 Scikit-Learn 问题找到的解决方案

For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxleto make Scikit-Learn fresh and useable within modern computing projects!

可以肯定的是，Scikit-Learn 非常方便且构建良好。但是，它需要刷新。这是我们与Neuraxle合作的解决方案，使 Scikit-Learn 新鲜且可用于现代计算项目！

Additional pipeline methods and features offered through Neuraxle

通过Neuraxle提供的其他管道方法和功能

Note: if a step of a pipeline doesn't need to have one of the fit or transform methods, it could inherit from NonFittableMixinor NonTransformableMixinto be provided a default implementation of one of those methods to do nothing.

注意：如果管道的一个步骤不需要 fit 或 transform 方法之一，它可以从NonFittableMixin或NonTransformableMixin继承，以提供这些方法之一的默认实现以不执行任何操作。

As a starter, it is possible for pipelines or their steps to also optionally define those methods:

首先，管道或其步骤也可以选择定义这些方法：

“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.

“ setup”将在其每一步调用“setup”方法。例如，如果一个步骤包含一个 TensorFlow、PyTorch 或 Keras 神经网络，则这些步骤可以创建它们的神经图并在拟合之前在“设置”方法中将它们注册到 GPU。出于多种原因，不鼓励直接在步骤的构造函数中创建图形，例如，如果在自动机器学习算法中使用不同的超参数多次运行之前复制了步骤，该算法会为您搜索最佳超参数。
“ teardown”，与“setup”方法相反：它清除资源。

The following methods are provided by defaultto allow for managing hyperparameters:

将默认提供以下方法，使管理的超参数：

“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter' keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you'll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3)which means 1 to 3 layers. You can call .rvs()on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.

“ get_hyperparams” 将返回一个超参数字典。如果您的管道包含更多管道（嵌套管道），则超参数的键用双下划线“__”分隔符链接。
“ set_hyperparams”将允许您以与获取时相同的格式设置新的超参数。
“ get_hyperparams_space”允许您获取超参数的空间，如果您定义了一个，该空间将不为空。因此，这里与“get_hyperparams”的唯一区别是您将获得作为值的统计分布而不是精确值。例如，层数的一个超参数可以是 a RandInt(1, 3)，这意味着 1 到 3 层。您可以调用.rvs()此 dict 随机选择一个值并将其发送到“set_hyperparams”以尝试对其进行训练。
“ set_hyperparams_space”可用于使用与“ get_hyperparams_space”中相同的超参数分布类来设置新空间。

For more info on our suggested solutions, read the entries in the big list with links above.

有关我们建议的解决方案的更多信息，请阅读带有上面链接的大列表中的条目。