pandas 所有熊猫细胞的归类化
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47557563/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Lemmatization of all pandas cells
提问by james
I have a panda dataframe. There is one column, let's name it: 'col' Each entry of this column is a list of words. ['word1', 'word2', etc.]
我有一个Pandas数据框。有一个列,让我们命名为: 'col' 此列的每个条目都是一个单词列表。['word1'、'word2'等]
How can I efficiently compute the lemma of all of those words using the nltk library?
如何使用 nltk 库有效地计算所有这些词的引理?
import nltk
nltk.stem.WordNetLemmatizer().lemmatize('word')
I want to be able to find a lemma for all words of all cells in one column of a pandas dataset.
我希望能够为 Pandas 数据集的一列中所有单元格的所有单词找到一个引理。
My data looks similar to:
我的数据类似于:
import pandas as pd
data = [[['walked','am','stressed','Fruit']],[['going','gone','walking','riding','running']]]
df = pd.DataFrame(data,columns=['col'])
采纳答案by titipata
You can use apply
from pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like .
if you use whitespace tokenizer.
您可以使用apply
from pandas 和一个函数来对给定字符串中的每个单词进行词形还原。请注意,有多种方法可以对文本进行标记。您可能需要删除符号,就像.
使用空格标记器一样。
Below, I give an example on how to lemmatize a column of example dataframe.
下面,我将举例说明如何将一列示例数据框进行词形还原。
import nltk
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)
回答by Sahil Dahiya
|col|
['Sushi Bars', 'Restaurants']
['Burgers', 'Fast Food', 'Restaurants']
wnl = WordNetLemmatizer()
The below creates a function which takes list of words and returns list of lemmatized words. This should work.
下面创建了一个函数,它接受单词列表并返回词形还原单词列表。这应该有效。
def lemmatize(s):
'''For lemmatizing the word
'''
s = [wnl.lemmatize(word) for word in s]
return s
dataset = dataset.assign(col_lemma = dataset.col.apply(lambda x: lemmatize(x))