Python 结合 NLTK 和 scikit-learn 中的文本词干和标点删除

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26126442/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:07:19  来源:igfitidea点击:

Combining text stemming and removal of punctuation in NLTK and scikit-learn

pythontextscikit-learnnltk

提问by

I am using a combination of NLTK and scikit-learn's CountVectorizerfor stemming words and tokenization.

我使用 NLTK 和scikit-learn's的组合CountVectorizer来词干和标记化。

Below is an example of the plain usage of the CountVectorizer:

下面是一个简单用法的例子CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which will print

哪个会打印

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:

现在,假设我想删除停用词并提取词干。一种选择是这样做:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which prints:

哪个打印:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

But how would I best get rid of the punctuation characters in this second version?

但是我如何最好地摆脱第二个版本中的标点符号?

采纳答案by alvas

There are several options, try remove the punctuation before tokenization. But this would mean that don't-> dont

有几个选项,尝试在标记化之前删除标点符号。但这意味着don't->dont

import string

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

Or try removing punctuation after tokenization.

或者尝试在标记化后删除标点符号。

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    stems = stem_tokens(tokens, stemmer)
    return stems

EDITED

已编辑

The above code will work but it's rather slow because it's looping through the same text multiple times:

上面的代码可以工作,但速度很慢,因为它多次循环遍历相同的文本:

  • Once to remove punctuation
  • Second time to tokenize
  • Third time to stem.
  • 一次删除标点符号
  • 第二次标记
  • 第三次干。

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

如果您有更多步骤,例如删除数字或删除停用词或小写等。

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

最好将这些步骤尽可能地放在一起,如果您的数据需要更多的预处理步骤,这里有几个更好的答案会更有效: