Python 如何从文本数据中获取词袋？

Question

提问by hshed

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

我正在使用大型文本数据集处理预测问题。我正在实施词袋模型。

What should be the best way to get the bag of words? Right now, I have tf-idfof the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

获取单词包的最佳方法应该是什么？现在，我有各种单词的tf-idf并且单词数量太多，无法将其用于进一步的分配。如果我使用 tf-idf 标准，那么获取词袋的 tf-idf 阈值应该是多少？或者我应该使用其他一些算法。我正在使用蟒蛇。

Answer 1

回答by Paddy3118

Using the collections.Counter class

使用collections.Counter 类

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
   'John also likes to watch football games.']
>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>>

Answer 2

回答by Far

Bag-of-words model is a nice method for text representation to be applied in different machine learning tasks. But in the first step you need to clean up data from unnecessary data for example punctuation, html tags, stop-words,... For these tasks you may can easily exploit libraries like Beautiful Soup(to remove HTML Markups) or NLTK(to remove stop-words) in Python. After cleaning your data you need to create a vector features (numerical representation of data for machine learning) this is where Bag-Of-Words plays the role. scikit-learnhas a module (feature_extractionmodule) which can help you create the bag-of-words features.

Bag-of-words 模型是一种很好的文本表示方法，可应用于不同的机器学习任务。但是在第一步中，您需要从不必要的数据中清除数据，例如标点符号、html 标签、停用词……对于这些任务，您可以轻松利用诸如Beautiful Soup（删除 HTML 标记）或NLTK（以删除 HTML 标记）之类的库。删除停用词）在 Python 中。清理数据后，您需要创建向量特征（用于机器学习的数据的数值表示），这就是 Bag-Of-Words 发挥作用的地方。scikit-learn有一个模块（feature_extraction模块）可以帮助您创建词袋特征。

You may find all you need in detail in this tutorialalso thisone can be very helpful. I found both of them very useful.

你可能会发现你在细节需要在这个教程中也这个人可以是非常有益的。我发现它们都非常有用。

Answer 3

回答by Jivan

As others already mentioned, using nltkwould be your best option if you want something stable, and scalable. It's highly configurable.

正如其他人已经提到的，nltk如果您想要稳定且可扩展的东西，使用将是您的最佳选择。它是高度可配置的。

However, it has the downside of having a quite steep learning curve, if you want to tweak the defaults.

但是，如果您想调整默认值，它的缺点是学习曲线非常陡峭。

I once encountered a situation where I wanted to have a bag of words. Problem was, it concerned articles about technologies with exotic names full of -, _, etc. Such as vue-routeror _.jsetc.

曾经遇到过想拥有一袋词的情况。问题是，它大约有技术充满异国情调的名字有关的文章-，_等如vue-router或_.js等。

The default configuration of nltk's word_tokenizeis to split vue-routerinto two separate vueand routerwords, for instance. I'm not even talking about _.js.

例如，nltk 的默认配置word_tokenize是vue-router分成两个单独的vue和router单词。我什至不是在谈论_.js。

So for what it's worth, I ended up writing this little routine to get all the words tokenized into a list, based on my own punctuation criteria.

因此，对于它的价值，我最终编写了这个小程序list，根据我自己的标点符号标准将所有单词标记为。

import re

punctuation_pattern = ' |\.$|\. |, |\/|\(|\)|\'|\"|\!|\?|\+'
text = "This article is talking about vue-router. And also _.js."
ltext = text.lower()
wtext = [w for w in re.split(punctuation_pattern, ltext) if w]

print(wtext)
# ['this', 'article', 'is', 'talking', 'about', 'vue-router', 'and', 'also', '_.js']

This routine can be easily combined with Patty3118 answer about collections.Counter, which could lead you to know which number of times _.jswas mentioned in the article, for instance.

这个例程可以很容易地与 Patty3118 的回答 about 结合起来，例如collections.Counter，这可以让你知道_.js文章中提到了多少次。

Answer 4

回答by Jivan

From a book "Machine learning python":

从一本书“机器学习python”：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['blablablatext'])
bag = count.fit_transform(docs)

Answer 5

回答by Pramit

Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps

词袋可以定义为一个矩阵，其中每行代表一个文档，列代表单个标记。还有一件事，文本的顺序没有得到维护。构建“词袋”涉及 3 个步骤

tokenizing
counting
normalizing

标记化
数数
标准化

Limitations to keep in mind: 1. Cannot capture phrases or multi-word expressions 2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,

要记住的限制： 1. 无法捕获短语或多词表达 2. 对拼写错误很敏感，可以使用拼写校正器或字符表示来解决这个问题，

e.g.

例如

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())

Python 如何从文本数据中获取词袋？

提问by hshed

回答by Paddy3118

回答by Far

回答by Jivan

回答by Jivan

回答by Pramit

相关推荐

最近更新

标签

Python 如何从文本数据中获取词袋？

提问by hshed

回答by Paddy3118

回答by Far

回答by Jivan

回答by Jivan

回答by Pramit

相关推荐

Python 如何腌制一个列表？

将json字符串反序列化为python中的对象

Python 如何检查浮点值是否在某个范围内并具有给定的十进制数字？

Python 从字符串中删除最后一个字符

相关推荐

最近更新

标签