Python 词干提取(使用 Pandas 数据框)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37443138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:17:23 来源:igfitidea点击:
Python stemming (with pandas dataframe)
提问by Chiel
I created a dataframe with sentences to be stemmed. I would like to use a Snowballstemmer to obtain higher accuracy with my classification algorithm. How can I achieve this?
我创建了一个包含要词干的句子的数据框。我想使用 Snowballstemmer 来通过我的分类算法获得更高的准确性。我怎样才能做到这一点?
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
# Use English stemmer.
stemmer = SnowballStemmer("english")
# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"]
# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed'])
# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()
# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)
# Print dataframe.
df
+----+--------------------------------------------------------------+
| | unstemmed |
|----+--------------------------------------------------------------|
| 0 | ['programers', 'program', 'with', 'programing', 'languages'] |
| 1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must', |
| | 'be', 'a', 'bug', 'in', 'the', 'interpreter'] |
+----+--------------------------------------------------------------+
回答by arthur
You have to apply the stemming on each word and store it into the "stemmed" column.
您必须对每个单词应用词干并将其存储到“词干”列中。
df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.
+----+--------------------------------------------------------------+
| | stemmed |
|----+--------------------------------------------------------------|
| 0 | ['program', 'program', 'with', 'program', 'languag'] |
| 1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must', |
| | 'be', 'a', 'bug', 'in', 'the', 'interpret'] |
+----+--------------------------------------------------------------+