pandas 所有熊猫细胞的归类化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47557563/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:50:16  来源:igfitidea点击:

Lemmatization of all pandas cells

pythonpandas

提问by james

I have a panda dataframe. There is one column, let's name it: 'col' Each entry of this column is a list of words. ['word1', 'word2', etc.]

我有一个Pandas数据框。有一个列,让我们命名为: 'col' 此列的每个条目都是一个单词列表。['word1'、'word2'等]

How can I efficiently compute the lemma of all of those words using the nltk library?

如何使用 nltk 库有效地计算所有这些词的引理?

import nltk
nltk.stem.WordNetLemmatizer().lemmatize('word')

I want to be able to find a lemma for all words of all cells in one column of a pandas dataset.

我希望能够为 Pandas 数据集的一列中所有单元格的所有单词找到一个引理。

My data looks similar to:

我的数据类似于:

import pandas as pd
data = [[['walked','am','stressed','Fruit']],[['going','gone','walking','riding','running']]]
df = pd.DataFrame(data,columns=['col'])

采纳答案by titipata

You can use applyfrom pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like .if you use whitespace tokenizer.

您可以使用applyfrom pandas 和一个函数来对给定字符串中的每个单词进行词形还原。请注意,有多种方法可以对文本进行标记。您可能需要删除符号,就像.使用空格标记器一样。

Below, I give an example on how to lemmatize a column of example dataframe.

下面,我将举例说明如何将一列示例数据框进行词形还原。

import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)

回答by Sahil Dahiya

|col| 
['Sushi Bars', 'Restaurants']
['Burgers', 'Fast Food', 'Restaurants']

wnl = WordNetLemmatizer()

The below creates a function which takes list of words and returns list of lemmatized words. This should work.

下面创建了一个函数,它接受单词列表并返回词形还原单词列表。这应该有效。

def lemmatize(s):
'''For lemmatizing the word
'''
     s = [wnl.lemmatize(word) for word in s]
     return s

dataset = dataset.assign(col_lemma = dataset.col.apply(lambda x: lemmatize(x))