Python sklearn中的'transform'和'fit_transform'有什么区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23838056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:30:36  来源:igfitidea点击:

what is the difference between 'transform' and 'fit_transform' in sklearn

pythonpython-2.7scikit-learn

提问by tqjustc

In the sklearn-python toolbox, there are two functions transformand fit_transformabout sklearn.decomposition.RandomizedPCA. The description of two functions are as follows

在 sklearn-python 工具箱中,有两个函数transformfit_transformabout sklearn.decomposition.RandomizedPCA。两个函数的说明如下

enter image description hereenter image description here

在此处输入图片说明在此处输入图片说明

But what is the difference between them ?

但是它们之间有什么区别呢?

采纳答案by Donbeo

Here the difference you can use pca.transform only if you have already computed PCA on a matrix

这里的区别只有当您已经在矩阵上计算了 PCA 时才可以使用 pca.transform

   In [12]: pc2 = RandomizedPCA(n_components=3)

    In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-13-e3b6b8ea2aff> in <module>()
    ----> 1 pc2.transform(X)

    /usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
        714         # XXX remove scipy.sparse support here in 0.16
        715         X = atleast2d_or_csr(X)
    --> 716         if self.mean_ is not None:
        717             X = X - self.mean_
        718 

    AttributeError: 'RandomizedPCA' object has no attribute 'mean_'

    In [14]: pc2.ftransform(X) 
    pc2.fit            pc2.fit_transform  

    In [14]: pc2.fit_transform(X)
    Out[14]: 
    array([[-1.38340578, -0.2935787 ],
           [-2.22189802,  0.25133484],
           [-3.6053038 , -0.04224385],
           [ 1.38340578,  0.2935787 ],
           [ 2.22189802, -0.25133484],
           [ 3.6053038 ,  0.04224385]])

if you want to use .transformyou need to teach the transformation rule to your pca

如果您想使用,.transform您需要将转换规则教给您的 PCA

In [20]: pca = RandomizedPCA(n_components=3)

In [21]: pca.fit(X)
Out[21]: 
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
       whiten=False)

In [22]: pca.transform(z)
Out[22]: 
array([[ 2.76681156,  0.58715739],
       [ 1.92831932,  1.13207093],
       [ 0.54491354,  0.83849224],
       [ 5.53362311,  1.17431479],
       [ 6.37211535,  0.62940125],
       [ 7.75552113,  0.92297994]])

In [23]: 

In particular PCA transform apply the change of basis obtained with the PCA decomposition of the matrix X to the matrix Z.

特别是 PCA 变换将通过矩阵 X 的 PCA 分解获得的基变化应用于矩阵 Z。

回答by Ronak Poriya

In scikit-learn estimator api,

scikit-learn estimator api 中

fit(): used for generating learning model parameters from training data

fit(): 用于从训练数据生成学习模型参数

transform(): parameters generated from fit()method,applied upon model to generate transformed data set.

transform(): 从fit()方法生成的参数,应用于模型以生成转换后的数据集。

fit_transform(): combination of fit()and transform()api on same data set

fit_transform():fit()transform()api 在同一数据集上的组合

enter image description here

在此处输入图片说明

Checkout Chapter-4from this book& answer from stackexchangefor more clarity

从这本书中查看第 4 章并从stackexchange 中回答以获得更清晰的信息

回答by shaurya uppal

These methods are used to center/feature scale of a given data. It basically helps to normalize the data within a particular range

这些方法用于给定数据的中心/特征尺度。它基本上有助于规范特定范围内的数据

For this, we use Z-score method.

为此,我们使用 Z-score 方法。

Z-Score

Z-分数

We do this on the training set of data.

我们在训练数据集上这样做。

1.Fit():Method calculates the parameters μ and σ and saves them as internal objects.

1. Fit():方法计算参数μ和σ并保存为内部对象。

2.Transform():Method using these calculated parameters apply the transformation to a particular dataset.

2. Transform():使用这些计算参数的方法将转换应用于特定数据集。

3.Fit_transform():joins the fit() and transform() method for transformation of dataset.

3. Fit_transform():结合fit()和transform()方法对数据集进行变换。

Code snippet for Feature Scaling/Standardisation(after train_test_split).

特征缩放/标准化的代码片段(在 train_test_split 之后)。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.

我们在我们的测试集上应用相同的(训练集相同的两个参数 μ 和 σ(值))参数转换。

回答by Nikita Jain

Generic difference between the methods:

方法之间的一般区别:

  • fit(raw_documents[, y]): Learn a vocabulary dictionary of all tokens in the raw documents.
  • fit_transform(raw_documents[, y]): Learn the vocabulary dictionary and return term-document matrix. This is equivalent to fit followed by the transform, but more efficiently implemented.
  • transform(raw_documents): Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
  • fit(raw_documents[, y]):学习原始文档中所有标记的词汇字典。
  • fit_transform(raw_documents[, y]):学习词汇字典并返回term-document矩阵。这等效于 fit 后跟变换,但更有效地实现。
  • 转换(raw_documents):将文档转换为文档项矩阵。使用适合的词汇表或提供给构造函数的词汇表从原始文本文档中提取标记计数。

Both fit_transform and transform returns the same, Document-term matrix.

fit_transform 和 transform 都返回相同的 Document-term 矩阵。

Source

来源

回答by a zEnItH

Here the basic difference between .fit()& .fit_transform():

这里.fit()&之间的基本区别.fit_transform()

.fit():

。合身():

is use in the Supervised learning having two object/parameter(x,y) to fit model and make model to run, where we know that what we are going to predict

在监督学习中使用有两个对象/参数(x,y)来拟合模型并使模型运行,我们知道我们要预测什么

.fit_transform():

.fit_transform():

is use in Unsupervised Learning having one object/parameter(x), where we don't know, what we are going to predict.

用于具有一个对象/参数(x)的无监督学习,我们不知道我们将要预测什么。

回答by Rafa Nogales

Why and When use each one:

为什么和何时使用每一个:

All the responses are quite good, but I would make emphasis in WHY and WHEN use each method.

所有的反应都很好,但我会强调为什么和何时使用每种方法。

fit(), transform(), fit_transform()

fit()、transform()、fit_transform()

Usually we have a supervised learning problem with (X, y) as out dataset, and we split it into training data and test data:

通常我们有一个以 (X, y) 作为输出数据集的监督学习问题,我们将其拆分为训练数据和测试数据:

import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)

Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!

想象一下,我们正在拟合一个分词器,如果我们拟合 X,我们会将测试数据包含到分词器中,但我已经多次看到这个错误!

The correct is to fit ONLY with X_train, because you don't know "your future data" so you cannot use X_test data for fitting anything!

正确的是只适合 X_train,因为你不知道“你未来的数据”,所以你不能使用 X_test 数据来拟合任何东西!

Then you can transform your test data, but separately, that's why there are different methods.

然后你可以转换你的测试数据,但分开,这就是为什么有不同的方法。

Final tip: X_train_transformed = model.fit_transform(X_train)is equivalent to: X_train_transformed = model.fit(X_train).transform(X_train), but the first one is faster.

最后提示:X_train_transformed = model.fit_transform(X_train)相当于: X_train_transformed = model.fit(X_train).transform(X_train),但第一个更快。

Note that what I call "model" usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer...

请注意,我所说的“模型”通常是缩放器、tfidf 转换器、其他类型的矢量化器、标记器......

回答by DhruvStan7

In layman's terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.

通俗地说,fit_transform 的意思是先做一些计算,然后再做转换(比如从一些数据中计算列的均值,然后替换缺失的值)。所以对于训练集,你需要计算和转换。

But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.

但是对于测试集,机器学习根据在训练集中学到的东西应用预测,因此不需要计算,它只执行转换。