Python 如何从文本数据中获取词袋?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15507172/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:17:03  来源:igfitidea点击:

How to get bag of words from textual data?

pythonmachine-learningtext-processing

提问by hshed

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

我正在使用大型文本数据集处理预测问题。我正在实施词袋模型。

What should be the best way to get the bag of words? Right now, I have tf-idfof the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

获取单词包的最佳方法应该是什么?现在,我有各种单词的tf-idf并且单词数量太多,无法将其用于进一步的分配。如果我使用 tf-idf 标准,那么获取词袋的 tf-idf 阈值应该是多少?或者我应该使用其他一些算法。我正在使用蟒蛇。

回答by Paddy3118

Using the collections.Counter class

使用collections.Counter 类

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
   'John also likes to watch football games.']
>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>> 

回答by Far

Bag-of-words model is a nice method for text representation to be applied in different machine learning tasks. But in the first step you need to clean up data from unnecessary data for example punctuation, html tags, stop-words,... For these tasks you may can easily exploit libraries like Beautiful Soup(to remove HTML Markups) or NLTK(to remove stop-words) in Python. After cleaning your data you need to create a vector features (numerical representation of data for machine learning) this is where Bag-Of-Words plays the role. scikit-learnhas a module (feature_extractionmodule) which can help you create the bag-of-words features.

Bag-of-words 模型是一种很好的文本表示方法,可应用于不同的机器学习任务。但是在第一步中,您需要从不必要的数据中清除数据,例如标点符号、html 标签、停用词……对于这些任务,您可以轻松利用诸如Beautiful Soup(删除 HTML 标记)或NLTK(以删除 HTML 标记)之类的库。删除停用词)在 Python 中。清理数据后,您需要创建向量特征(用于机器学习的数据的数值表示),这就是 Bag-Of-Words 发挥作用的地方。scikit-learn有一个模块(feature_extraction模块)可以帮助您创建词袋特征。

You may find all you need in detail in this tutorialalso thisone can be very helpful. I found both of them very useful.

你可能会发现你在细节需要在这个教程中这个人可以是非常有益的。我发现它们都非常有用。

回答by Jivan

As others already mentioned, using nltkwould be your best option if you want something stable, and scalable. It's highly configurable.

正如其他人已经提到的,nltk如果您想要稳定且可扩展的东西,使用将是您的最佳选择。它是高度可配置的。

However, it has the downside of having a quite steep learning curve, if you want to tweak the defaults.

但是,如果您想调整默认值,它的缺点是学习曲线非常陡峭。

I once encountered a situation where I wanted to have a bag of words. Problem was, it concerned articles about technologies with exotic names full of -, _, etc. Such as vue-routeror _.jsetc.

曾经遇到过想拥有一袋词的情况。问题是,它大约有技术充满异国情调的名字有关的文章-_等如vue-router_.js等。

The default configuration of nltk's word_tokenizeis to split vue-routerinto two separate vueand routerwords, for instance. I'm not even talking about _.js.

例如,nltk 的默认配置word_tokenizevue-router分成两个单独的vuerouter单词。我什至不是在谈论_.js

So for what it's worth, I ended up writing this little routine to get all the words tokenized into a list, based on my own punctuation criteria.

因此,对于它的价值,我最终编写了这个小程序list,根据我自己的标点符号标准将所有单词标记为。

import re

punctuation_pattern = ' |\.$|\. |, |\/|\(|\)|\'|\"|\!|\?|\+'
text = "This article is talking about vue-router. And also _.js."
ltext = text.lower()
wtext = [w for w in re.split(punctuation_pattern, ltext) if w]

print(wtext)
# ['this', 'article', 'is', 'talking', 'about', 'vue-router', 'and', 'also', '_.js']

This routine can be easily combined with Patty3118 answer about collections.Counter, which could lead you to know which number of times _.jswas mentioned in the article, for instance.

这个例程可以很容易地与 Patty3118 的回答 about 结合起来,例如collections.Counter,这可以让你知道_.js文章中提到了多少次。

回答by Jivan

From a book "Machine learning python":

从一本书“机器学习python”:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['blablablatext'])
bag = count.fit_transform(docs)

回答by Pramit

Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps

词袋可以定义为一个矩阵,其中每行代表一个文档,列代表单个标记。还有一件事,文本的顺序没有得到维护。构建“词袋”涉及 3 个步骤

  1. tokenizing
  2. counting
  3. normalizing
  1. 标记化
  2. 数数
  3. 标准化

Limitations to keep in mind: 1. Cannot capture phrases or multi-word expressions 2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,

要记住的限制: 1. 无法捕获短语或多词表达 2. 对拼写错误很敏感,可以使用拼写校正器或字符表示来解决这个问题,

e.g.

例如

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())