Python 从 Pandas 数据框中计算不同的单词

Question

提问by ADJ

I've a Pandas data frame, where one column contains text. I'd like to get a list of unique words appearing across the entire column (space being the only split).

我有一个 Pandas 数据框，其中一列包含文本。我想获得整个列中出现的唯一单词列表（空格是唯一的分隔符）。

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

The output should look like this:

输出应如下所示：

['my','nickname','is','ft.jgt','someone','going','to','place']

It wouldn't hurt to get a count as well, but it is not required.

获得计数也没有什么坏处，但这不是必需的。

Answer 1

采纳答案by Boud

Use a setto create the sequence of unique elements.

使用 aset创建唯一元素的序列。

Do some clean-up on dfto get the strings in lower case and split:

做一些清理工作df以获得小写的字符串并拆分：

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

Each list in this column can be passed to set.updatefunction to get unique values. Use applyto do so:

此列中的每个列表都可以传递给set.update函数以获取唯一值。使用apply这样做：

results = set()
df['text'].str.lower().str.split().apply(results.update)
print results

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

Answer 2

回答by Brionius

uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)

Answer 3

回答by Ofir Israel

Use collections.Counter:

使用collections.Counter：

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]

Answer 4

回答by EdChum

Building on @Ofir Israel's answer, specific to Pandas:

以@Ofir Israel 的回答为基础，专门针对 Pandas：

from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result

Will give you what you want, this converts the text column series values to a list, splits on spaces and counts the instances.

会给你你想要的，这会将文本列系列值转换为一个列表，在空格上拆分并计算实例。

Answer 5

回答by cwharland

If you want to do it from the DataFrame construct:

如果您想从 DataFrame 构造中执行此操作：

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

If you want a more flexible tokenization use nltkand its tokenize

如果您想要更灵活的标记化使用nltk及其tokenize

Answer 6

回答by Rakesh Chaudhari

If Dataframe has ' a', 'b', 'c' etc, column And to count distinct words of each column then You could use,

如果 Dataframe 具有 'a'、'b'、'c' 等列并且要计算每列的不同单词，那么您可以使用，

Counter(dataframe['a']).items()

Answer 7

回答by alvas

TL;DR

TL; 博士

Use collections.Counterto get the counts of unique words in column in dataframe (without stopwords)

使用collections.Counter获得的独特单词数列中的数据帧中（不含禁用词）

Given:

鉴于：

$ cat test.csv 
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces 4.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...

Code:

代码：

from collections import Counter
from string import punctuation

import pandas as pd

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist = set(stopwords.words('english') + list(punctuation))

df = pd.read_csv("test.csv", sep='\t')

texts = df['Description'].str.lower()

word_counts = Counter(word_tokenize('\n'.join(texts)))

word_count.most_common()

[out]:

[出去]：

[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]

Answer 8

回答by Ludecan

Adding to the discussion, here are the timings for three of the proposed solutions (skipping conversion to list) on a 92816 row dataframe:

除了讨论之外，以下是针对 92816 行数据帧的三个建议解决方案（跳过转换为列表）的时间安排：

from collections import Counter
results = set()

%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))

323 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

每个循环 323 ms ± 4.46 ms（7 次运行的平均值 ± 标准偏差，每次 10 次循环）

%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)

316 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

每个循环 316 ms ± 4.22 ms（7 次运行的平均值 ± 标准偏差，每次 10 次循环）

%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))

365 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

每个循环 365 ms ± 2.5 ms（7 次运行的平均值 ± 标准偏差，每次 10 次循环）

len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))

13561

len(results)

13561

len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())

13561

I tried the Pandas only approach too but it took way longer and used > 25GB of RAM making my 32GB laptop swap.

我也尝试了 Pandas only 方法，但它花费了更长的时间，并且使用了 > 25GB 的 RAM 使我的 32GB 笔记本电脑交换。

All others are pretty fast. I would use solution 1 for being a one liner, or 3 if word counts are needed.

所有其他人都很快。如果需要字数统计，我会使用解决方案 1 作为单衬纸，或者使用 3 解决方案。

Python 从 Pandas 数据框中计算不同的单词

提问by ADJ

采纳答案by Boud

回答by Brionius

回答by Ofir Israel

回答by EdChum

回答by cwharland

回答by Rakesh Chaudhari

回答by alvas

TL;DR

TL; 博士

回答by Ludecan

相关推荐

最近更新

标签

Python 从 Pandas 数据框中计算不同的单词

提问by ADJ

采纳答案by Boud

回答by Brionius

回答by Ofir Israel

回答by EdChum

回答by cwharland

回答by Rakesh Chaudhari

回答by alvas

TL;DR

TL; 博士

回答by Ludecan

相关推荐

Python 如何更改图例中字体的文本颜色？

Python PYQT 布局和 setgeometry 基本概述

Python 如何在 Spark DataFrame 中添加常量列？

带有代码完成功能的 python/django Sublime Text 2 & 3 设置

相关推荐

最近更新

标签