pandas 所有熊猫细胞的归类化

Question

提问by james

I have a panda dataframe. There is one column, let's name it: 'col' Each entry of this column is a list of words. ['word1', 'word2', etc.]

我有一个Pandas数据框。有一个列，让我们命名为： 'col' 此列的每个条目都是一个单词列表。['word1'、'word2'等]

How can I efficiently compute the lemma of all of those words using the nltk library?

如何使用 nltk 库有效地计算所有这些词的引理？

import nltk
nltk.stem.WordNetLemmatizer().lemmatize('word')

I want to be able to find a lemma for all words of all cells in one column of a pandas dataset.

我希望能够为 Pandas 数据集的一列中所有单元格的所有单词找到一个引理。

My data looks similar to:

我的数据类似于：

import pandas as pd
data = [[['walked','am','stressed','Fruit']],[['going','gone','walking','riding','running']]]
df = pd.DataFrame(data,columns=['col'])

Answer 1

采纳答案by titipata

You can use applyfrom pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like .if you use whitespace tokenizer.

您可以使用applyfrom pandas 和一个函数来对给定字符串中的每个单词进行词形还原。请注意，有多种方法可以对文本进行标记。您可能需要删除符号，就像.使用空格标记器一样。

Below, I give an example on how to lemmatize a column of example dataframe.

下面，我将举例说明如何将一列示例数据框进行词形还原。

import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)

Answer 2

回答by Sahil Dahiya

|col| 
['Sushi Bars', 'Restaurants']
['Burgers', 'Fast Food', 'Restaurants']

wnl = WordNetLemmatizer()

The below creates a function which takes list of words and returns list of lemmatized words. This should work.

下面创建了一个函数，它接受单词列表并返回词形还原单词列表。这应该有效。

def lemmatize(s):
'''For lemmatizing the word
'''
     s = [wnl.lemmatize(word) for word in s]
     return s

dataset = dataset.assign(col_lemma = dataset.col.apply(lambda x: lemmatize(x))

pandas 所有熊猫细胞的归类化

提问by james

采纳答案by titipata

回答by Sahil Dahiya

相关推荐

最近更新

标签

pandas 所有熊猫细胞的归类化

提问by james

采纳答案by titipata

回答by Sahil Dahiya

相关推荐

pandas 如何检查数据框中是否存在值

Pandas groupby 和聚合输出应包括所有原始列（包括未聚合的列）

Pandas 读取带有浮点值的 csv 文件会导致奇怪的四舍五入和十进制数字

在 Pandas 中的多列上应用自定义函数

相关推荐

最近更新

标签