使用 Scikit for Python 保留 TFIDF 结果以预测新内容

Question

提问by lol.Wen

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

我在 Python 上使用 sklearn 来做一些聚类。我已经训练了 200,000 个数据，下面的代码运行良好。

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

但是当我有新的测试内容时，我想将它聚集到我训练过的现有集群中。所以我想知道如何保存 IDF 结果，以便我可以对新的测试内容进行 TFIDF，并确保新测试内容的结果具有相同的数组长度。

Thanks in advance.

提前致谢。

UPDATE

更新

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

如果其中一个包含经过训练的 IDF 结果，我可能需要将“transformer”或“tfidf”变量保存到文件（txt 或其他）中。

UPDATE

更新

For example. I have the training data:

例如。我有训练数据：

["a", "b", "c"]
["a", "b", "d"]

And do TFIDF, the result will contains 4 features(a,b,c,d)

并做TFIDF，结果将包含4个特征（a，b，c，d）

When I TEST:

当我测试时：

["a", "c", "d"]

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

查看它属于哪个集群（已经由 k-means 创建）。TFIDF 只会给出具有 3 个特征 (a,c,d) 的结果，因此 k-means 中的聚类会下降。（如果我测试["a", "b", "e"]，可能还有其他问题。）

So how to store the features list for testing data (even more, store it in file)?

那么如何存储测试数据的特征列表（甚至更多，将其存储在文件中）？

UPDATE

更新

Solved, see answers below.

已解决，请参阅下面的答案。

Answer 1

采纳答案by lol.Wen

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

我通过保存成功保存了功能列表vectorizer.vocabulary_，并通过CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

代码如下：

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidfwill have same feature length as trained data.

那个有效。tfidf将具有与训练数据相同的特征长度。

Answer 2

回答by JAB

you can do the vectorization and tfidf transformation in one stage:

您可以在一个阶段进行矢量化和 tfidf 转换：

vec =TfidfVectorizer()

then fit and transform on the training data

然后对训练数据进行拟合和变换

tfidf = vec.fit_transform(training_data)

and use the tfidf model to transform

并使用 tfidf 模型进行转换

unseen_tfidf = vec.transform(unseen_data)
km = KMeans(30)
kmresult = km.fit(tfidf).predict(unseen_tfid)

Answer 3

回答by user123

If you want to store features list for testing data for use in future, you can do this:

如果您想存储功能列表以供将来使用的测试数据，您可以这样做：

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

#store the content
with open("x_result.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)
#load the content
tfidf = pickle.load(open("x_result.pkl", "rb" ) )

Answer 4

回答by GoatWang

a simpler solution, just use joblib libarary as documentsaid:

一个更简单的解决方案，只需使用 joblib 库作为文档说：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.externals import joblib

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
feature_name = vectorizer.get_feature_names()
tfidf = TfidfTransformer()
tfidf.fit(X)

# save your model in disk
joblib.dump(transformer, 'tfidf.pkl') 

# load your model
tfidf = joblib.load('tfidf.pkl')

Answer 5

回答by Arjun Mishra

Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

可以直接使用 tfidfvectorizer 的词汇表，而不是使用 CountVectorizer 来存储词汇表。

Training phase:

训练阶段：

from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf based vectors
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)

# Fit the model
tf_transformer = tf.fit(corpus)

# Dump the file
pickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))


# Testing phase
tf1 = pickle.load(open("tfidf1.pkl", 'rb'))

# Create new tfidfVectorizer with old vocabulary
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,
                          max_features = 500000, vocabulary = tf1.vocabulary_)
X_tf1 = tf1_new.fit_transform(new_corpus)

The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.

fit_transform 在这里工作，因为我们使用的是旧词汇。如果您没有存储 tfidf，您将只对测试数据使用转换。即使您在那里进行转换，来自测试数据的新文档也正在“适合”火车矢量化器的词汇表。这正是我们在这里所做的。我们唯一可以为 tfidf 向量化器存储和重用的东西就是词汇表。

使用 Scikit for Python 保留 TFIDF 结果以预测新内容

提问by lol.Wen

采纳答案by lol.Wen

回答by JAB

回答by user123

回答by GoatWang

回答by Arjun Mishra

相关推荐

最近更新

标签

使用 Scikit for Python 保留 TFIDF 结果以预测新内容

提问by lol.Wen

采纳答案by lol.Wen

回答by JAB

回答by user123

回答by GoatWang

回答by Arjun Mishra

相关推荐

Python 如何在 PySpark 中读取 Avro 文件

如何使用 Python 集合并将字符串作为字典值添加到其中

在 Windows 中为 Python 安装 FFMPEG

Python 如何更改 QPushButton 文本和背景颜色

相关推荐

最近更新

标签