pandas 计算每行的单词数

Question

提问by LMGagne

I'm trying to create a new column in a dataframe that contains the word count for the respective row. I'm looking to the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I'm stuck. I've tried the solutions put forward in the linked SO posts, but get lots of attribute errors back.

我正在尝试在包含相应行的字数的数据框中创建一个新列。我正在寻找单词的总数，而不是每个不同单词的频率。我认为会有一种简单/快速的方法来完成这项常见任务，但是在谷歌搜索并阅读了一些 SO 帖子（1、2、3、4）之后，我被卡住了。我已经尝试了链接的 SO 帖子中提出的解决方案，但返回了很多属性错误。

words = df['col'].split()
df['totalwords'] = len(words)

results in

结果是

AttributeError: 'Series' object has no attribute 'split'

and

和

f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)

results in

结果是

AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')

Answer 1

回答by cs95

`str.split`+ `str.len`

str.lenworks nicely for any non-numeric column.

str.len适用于任何非数字列。

df['totalwords'] = df['col'].str.split().str.len()

`str.count`

If your words are single-space separated, you may simply count the spaces plus 1.

如果你的单词是单空格分隔的，你可以简单地计算空格加 1。

df['totalwords'] = df['col'].str.count(' ') + 1

List Comprehension

列表理解

This is faster than you think!

这比你想象的要快！

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]

Answer 2

回答by sacuL

Here is a way using .apply():

这是一种使用方法.apply()：

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

example

例子

Given this df:

鉴于此df：

>>> df
                    col
0  This is one sentence
1           and another

After applying the .apply()

应用后 .apply()

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

>>> df
                    col  number_of_words
0  This is one sentence                4
1           and another                2

Note: As pointed out by in comments, and in this answer, .applyis not necessarily the fastest method. If speed is important, better go with one of @c???s????'smethods.

注意：正如评论中所指出的，在这个答案中，.apply不一定是最快的方法。如果速度很重要，最好使用@c???s???? 的方法之一。

Answer 3

回答by jpp

This is one way using pd.Series.str.splitand pd.Series.map:

这是使用pd.Series.str.splitand 的一种方式pd.Series.map：

df['word_count'] = df['col'].str.split().map(len)

The above assumes that df['col']is a series of strings.

以上假设df['col']是一系列字符串。

Example:

例子：

df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})

df['word_count'] = df['col'].str.split().map(len)

print(df)

#                   col  word_count
# 0  This is an example           4
# 1     This is another           3
# 2             A third           2

Answer 4

回答by YOBEN_S

With listand mapdata from cold

随着list和map冷数据

list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]

pandas 计算每行的单词数

提问by LMGagne

回答by cs95

`str.split`+ `str.len`

`str.split`+ `str.len`

`str.count`

`str.count`

List Comprehension

列表理解

回答by sacuL

回答by jpp

回答by YOBEN_S

相关推荐

最近更新

标签

pandas 计算每行的单词数

提问by LMGagne

回答by cs95

str.split+ str.len

str.split+ str.len

str.count

str.count

List Comprehension

列表理解

回答by sacuL

回答by jpp

回答by YOBEN_S

相关推荐

删除 Pandas 中“空”值超过 60% 的列

在 Python Pandas -> 字符串列表中查找两列的交集

pandas - 绘制列变量的分布

pandas 找不到满足 numpy == 1.9.3 要求的版本

相关推荐

最近更新

标签

`str.split`+ `str.len`

`str.split`+ `str.len`

`str.count`

`str.count`