Python sklearn中的'transform'和'fit_transform'有什么区别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23838056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
what is the difference between 'transform' and 'fit_transform' in sklearn
提问by tqjustc
In the sklearn-python toolbox, there are two functions transform
and fit_transform
about sklearn.decomposition.RandomizedPCA
. The description of two functions are as follows
在 sklearn-python 工具箱中,有两个函数transform
和fit_transform
about sklearn.decomposition.RandomizedPCA
。两个函数的说明如下
But what is the difference between them ?
但是它们之间有什么区别呢?
采纳答案by Donbeo
Here the difference you can use pca.transform only if you have already computed PCA on a matrix
这里的区别只有当您已经在矩阵上计算了 PCA 时才可以使用 pca.transform
In [12]: pc2 = RandomizedPCA(n_components=3)
In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-e3b6b8ea2aff> in <module>()
----> 1 pc2.transform(X)
/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
714 # XXX remove scipy.sparse support here in 0.16
715 X = atleast2d_or_csr(X)
--> 716 if self.mean_ is not None:
717 X = X - self.mean_
718
AttributeError: 'RandomizedPCA' object has no attribute 'mean_'
In [14]: pc2.ftransform(X)
pc2.fit pc2.fit_transform
In [14]: pc2.fit_transform(X)
Out[14]:
array([[-1.38340578, -0.2935787 ],
[-2.22189802, 0.25133484],
[-3.6053038 , -0.04224385],
[ 1.38340578, 0.2935787 ],
[ 2.22189802, -0.25133484],
[ 3.6053038 , 0.04224385]])
if you want to use .transform
you need to teach the transformation rule to your pca
如果您想使用,.transform
您需要将转换规则教给您的 PCA
In [20]: pca = RandomizedPCA(n_components=3)
In [21]: pca.fit(X)
Out[21]:
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
whiten=False)
In [22]: pca.transform(z)
Out[22]:
array([[ 2.76681156, 0.58715739],
[ 1.92831932, 1.13207093],
[ 0.54491354, 0.83849224],
[ 5.53362311, 1.17431479],
[ 6.37211535, 0.62940125],
[ 7.75552113, 0.92297994]])
In [23]:
In particular PCA transform apply the change of basis obtained with the PCA decomposition of the matrix X to the matrix Z.
特别是 PCA 变换将通过矩阵 X 的 PCA 分解获得的基变化应用于矩阵 Z。
回答by Ronak Poriya
In scikit-learn estimator api,
在scikit-learn estimator api 中,
fit()
: used for generating learning model parameters from training data
fit()
: 用于从训练数据生成学习模型参数
transform()
:
parameters generated from fit()
method,applied upon model to generate transformed data set.
transform()
: 从fit()
方法生成的参数,应用于模型以生成转换后的数据集。
fit_transform()
:
combination of fit()
and transform()
api on same data set
fit_transform()
:fit()
和transform()
api 在同一数据集上的组合
Checkout Chapter-4from this book& answer from stackexchangefor more clarity
从这本书中查看第 4 章并从stackexchange 中回答以获得更清晰的信息
回答by shaurya uppal
These methods are used to center/feature scale of a given data. It basically helps to normalize the data within a particular range
这些方法用于给定数据的中心/特征尺度。它基本上有助于规范特定范围内的数据
For this, we use Z-score method.
为此,我们使用 Z-score 方法。
We do this on the training set of data.
我们在训练数据集上这样做。
1.Fit():Method calculates the parameters μ and σ and saves them as internal objects.
1. Fit():方法计算参数μ和σ并保存为内部对象。
2.Transform():Method using these calculated parameters apply the transformation to a particular dataset.
2. Transform():使用这些计算参数的方法将转换应用于特定数据集。
3.Fit_transform():joins the fit() and transform() method for transformation of dataset.
3. Fit_transform():结合fit()和transform()方法对数据集进行变换。
Code snippet for Feature Scaling/Standardisation(after train_test_split).
特征缩放/标准化的代码片段(在 train_test_split 之后)。
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)
We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.
我们在我们的测试集上应用相同的(训练集相同的两个参数 μ 和 σ(值))参数转换。
回答by Nikita Jain
Generic difference between the methods:
方法之间的一般区别:
- fit(raw_documents[, y]): Learn a vocabulary dictionary of all tokens in the raw documents.
- fit_transform(raw_documents[, y]): Learn the vocabulary dictionary and return term-document matrix. This is equivalent to fit followed by the transform, but more efficiently implemented.
- transform(raw_documents): Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
- fit(raw_documents[, y]):学习原始文档中所有标记的词汇字典。
- fit_transform(raw_documents[, y]):学习词汇字典并返回term-document矩阵。这等效于 fit 后跟变换,但更有效地实现。
- 转换(raw_documents):将文档转换为文档项矩阵。使用适合的词汇表或提供给构造函数的词汇表从原始文本文档中提取标记计数。
Both fit_transform and transform returns the same, Document-term matrix.
fit_transform 和 transform 都返回相同的 Document-term 矩阵。
回答by a zEnItH
Here the basic difference between .fit()
& .fit_transform()
:
这里.fit()
&之间的基本区别.fit_transform()
:
.fit():
。合身():
is use in the Supervised learning having two object/parameter(x,y) to fit model and make model to run, where we know that what we are going to predict
在监督学习中使用有两个对象/参数(x,y)来拟合模型并使模型运行,我们知道我们要预测什么
.fit_transform():
.fit_transform():
is use in Unsupervised Learning having one object/parameter(x), where we don't know, what we are going to predict.
用于具有一个对象/参数(x)的无监督学习,我们不知道我们将要预测什么。
回答by Rafa Nogales
Why and When use each one:
为什么和何时使用每一个:
All the responses are quite good, but I would make emphasis in WHY and WHEN use each method.
所有的反应都很好,但我会强调为什么和何时使用每种方法。
fit(), transform(), fit_transform()
fit()、transform()、fit_transform()
Usually we have a supervised learning problem with (X, y) as out dataset, and we split it into training data and test data:
通常我们有一个以 (X, y) 作为输出数据集的监督学习问题,我们将其拆分为训练数据和测试数据:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)
Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!
想象一下,我们正在拟合一个分词器,如果我们拟合 X,我们会将测试数据包含到分词器中,但我已经多次看到这个错误!
The correct is to fit ONLY with X_train, because you don't know "your future data" so you cannot use X_test data for fitting anything!
正确的是只适合 X_train,因为你不知道“你未来的数据”,所以你不能使用 X_test 数据来拟合任何东西!
Then you can transform your test data, but separately, that's why there are different methods.
然后你可以转换你的测试数据,但分开,这就是为什么有不同的方法。
Final tip: X_train_transformed = model.fit_transform(X_train)
is equivalent to:
X_train_transformed = model.fit(X_train).transform(X_train)
, but the first one is faster.
最后提示:X_train_transformed = model.fit_transform(X_train)
相当于:
X_train_transformed = model.fit(X_train).transform(X_train)
,但第一个更快。
Note that what I call "model" usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer...
请注意,我所说的“模型”通常是缩放器、tfidf 转换器、其他类型的矢量化器、标记器......
回答by DhruvStan7
In layman's terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.
通俗地说,fit_transform 的意思是先做一些计算,然后再做转换(比如从一些数据中计算列的均值,然后替换缺失的值)。所以对于训练集,你需要计算和转换。
But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.
但是对于测试集,机器学习根据在训练集中学到的东西应用预测,因此不需要计算,它只执行转换。