Python 结合 NLTK 和 scikit-learn 中的文本词干和标点删除

Question

提问by

I am using a combination of NLTK and scikit-learn's CountVectorizerfor stemming words and tokenization.

我使用 NLTK 和scikit-learn's的组合CountVectorizer来词干和标记化。

Below is an example of the plain usage of the CountVectorizer:

下面是一个简单用法的例子CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which will print

哪个会打印

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:

现在，假设我想删除停用词并提取词干。一种选择是这样做：

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which prints:

哪个打印：

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

But how would I best get rid of the punctuation characters in this second version?

但是我如何最好地摆脱第二个版本中的标点符号？

Answer 1

采纳答案by alvas

There are several options, try remove the punctuation before tokenization. But this would mean that don't-> dont

有几个选项，尝试在标记化之前删除标点符号。但这意味着don't->dont

import string

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

Or try removing punctuation after tokenization.

或者尝试在标记化后删除标点符号。

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    stems = stem_tokens(tokens, stemmer)
    return stems

EDITED

已编辑

The above code will work but it's rather slow because it's looping through the same text multiple times:

上面的代码可以工作，但速度很慢，因为它多次循环遍历相同的文本：

Once to remove punctuation
Second time to tokenize
Third time to stem.

一次删除标点符号
第二次标记
第三次干。

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

如果您有更多步骤，例如删除数字或删除停用词或小写等。

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

最好将这些步骤尽可能地放在一起，如果您的数据需要更多的预处理步骤，这里有几个更好的答案会更有效：

Python 结合 NLTK 和 scikit-learn 中的文本词干和标点删除

提问by

采纳答案by alvas

EDITED

已编辑

相关推荐

最近更新

标签

Python 结合 NLTK 和 scikit-learn 中的文本词干和标点删除

提问by

采纳答案by alvas

EDITED

已编辑

相关推荐

Python 到 JSON 序列化在十进制上失败

Python django.core.exceptions.ImproperlyConfigured

让 Python 打印一天中的一小时

Python 将 Pandas 数据框显示为表格

相关推荐

最近更新

标签