pandas CountVectorizer 方法 get_feature_names() 生成代码但不生成单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47419633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
CountVectorizer method get_feature_names() produces codes but not words
提问by Dmitrij Burlaj
I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
我正在尝试使用 sklearn CountVectorizer 对一些文本进行矢量化。之后,我想查看生成矢量化器的功能。但相反,我得到了一个代码列表,而不是单词。这是什么意思以及如何处理这个问题?这是我的代码:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
And I got the following output:
我得到了以下输出:
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
and so on.
等等。
I need real feature names (words), not these codes. Can anybody help me please?
我需要真实的特征名称(单词),而不是这些代码。有人可以帮我吗?
UPDATE: I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
更新:我设法解决了这个问题,但是现在当我想查看我的单词时,我看到许多单词实际上不是单词,而是一组毫无意义的字母(见附上的截图)。在我使用 CountVectorizer 之前,有人知道如何过滤这些词吗?
回答by dhanush-ai1990
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
您正在使用 min_df = 1 这将包括在至少一个文档中找到的所有单词,即。所有的话。min_df 本身可以被视为一个超参数,用于删除最常用的单词。我建议使用 spacy 来标记单词并将它们作为字符串加入,然后再将其作为输入给 Count Vectorizer。
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
注意:您看到的特征名称实际上是词汇表的一部分。这只是噪音。如果要删除它们,请设置 min_df >1。
回答by Herc01
Here is what you can do get what you exactly want:
以下是您可以做的事情,以获得您真正想要的:
vectorizer=CountVectorizer()
vectorizer.fit_transform(df['message_encoding'])
feat_dict=vectorizer.vocabulary_.keys()