使用 Scikit for Python 保留 TFIDF 结果以预测新内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29788047/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Keep TFIDF result for predicting new content using Scikit for Python
提问by lol.Wen
I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.
我在 Python 上使用 sklearn 来做一些聚类。我已经训练了 200,000 个数据,下面的代码运行良好。
corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)
But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.
但是当我有新的测试内容时,我想将它聚集到我训练过的现有集群中。所以我想知道如何保存 IDF 结果,以便我可以对新的测试内容进行 TFIDF,并确保新测试内容的结果具有相同的数组长度。
Thanks in advance.
提前致谢。
UPDATE
更新
I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.
如果其中一个包含经过训练的 IDF 结果,我可能需要将“transformer”或“tfidf”变量保存到文件(txt 或其他)中。
UPDATE
更新
For example. I have the training data:
例如。我有训练数据:
["a", "b", "c"]
["a", "b", "d"]
And do TFIDF, the result will contains 4 features(a,b,c,d)
并做TFIDF,结果将包含4个特征(a,b,c,d)
When I TEST:
当我测试时:
["a", "c", "d"]
to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"]
, there may have other problems.)
查看它属于哪个集群(已经由 k-means 创建)。TFIDF 只会给出具有 3 个特征 (a,c,d) 的结果,因此 k-means 中的聚类会下降。(如果我测试["a", "b", "e"]
,可能还有其他问题。)
So how to store the features list for testing data (even more, store it in file)?
那么如何存储测试数据的特征列表(甚至更多,将其存储在文件中)?
UPDATE
更新
Solved, see answers below.
已解决,请参阅下面的答案。
采纳答案by lol.Wen
I successfully saved the feature list by saving vectorizer.vocabulary_
, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
我通过保存成功保存了功能列表vectorizer.vocabulary_
,并通过CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
Codes below:
代码如下:
corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))
#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))
That works. tfidf
will have same feature length as trained data.
那个有效。tfidf
将具有与训练数据相同的特征长度。
回答by JAB
you can do the vectorization and tfidf transformation in one stage:
您可以在一个阶段进行矢量化和 tfidf 转换:
vec =TfidfVectorizer()
then fit and transform on the training data
然后对训练数据进行拟合和变换
tfidf = vec.fit_transform(training_data)
and use the tfidf model to transform
并使用 tfidf 模型进行转换
unseen_tfidf = vec.transform(unseen_data)
km = KMeans(30)
kmresult = km.fit(tfidf).predict(unseen_tfid)
回答by user123
If you want to store features list for testing data for use in future, you can do this:
如果您想存储功能列表以供将来使用的测试数据,您可以这样做:
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
#store the content
with open("x_result.pkl", 'wb') as handle:
pickle.dump(tfidf, handle)
#load the content
tfidf = pickle.load(open("x_result.pkl", "rb" ) )
回答by GoatWang
a simpler solution, just use joblib libarary as documentsaid:
一个更简单的解决方案,只需使用 joblib 库作为文档说:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.externals import joblib
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
feature_name = vectorizer.get_feature_names()
tfidf = TfidfTransformer()
tfidf.fit(X)
# save your model in disk
joblib.dump(transformer, 'tfidf.pkl')
# load your model
tfidf = joblib.load('tfidf.pkl')
回答by Arjun Mishra
Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.
可以直接使用 tfidfvectorizer 的词汇表,而不是使用 CountVectorizer 来存储词汇表。
Training phase:
训练阶段:
from sklearn.feature_extraction.text import TfidfVectorizer
# tf-idf based vectors
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)
# Fit the model
tf_transformer = tf.fit(corpus)
# Dump the file
pickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))
# Testing phase
tf1 = pickle.load(open("tfidf1.pkl", 'rb'))
# Create new tfidfVectorizer with old vocabulary
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,
max_features = 500000, vocabulary = tf1.vocabulary_)
X_tf1 = tf1_new.fit_transform(new_corpus)
The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.
fit_transform 在这里工作,因为我们使用的是旧词汇。如果您没有存储 tfidf,您将只对测试数据使用转换。即使您在那里进行转换,来自测试数据的新文档也正在“适合”火车矢量化器的词汇表。这正是我们在这里所做的。我们唯一可以为 tfidf 向量化器存储和重用的东西就是词汇表。