pandas 计算每行的单词数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49984905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count number of words per row
提问by LMGagne
I'm trying to create a new column in a dataframe that contains the word count for the respective row. I'm looking to the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I'm stuck. I've tried the solutions put forward in the linked SO posts, but get lots of attribute errors back.
我正在尝试在包含相应行的字数的数据框中创建一个新列。我正在寻找单词的总数,而不是每个不同单词的频率。我认为会有一种简单/快速的方法来完成这项常见任务,但是在谷歌搜索并阅读了一些 SO 帖子(1、2、3、4)之后,我被卡住了。我已经尝试了链接的 SO 帖子中提出的解决方案,但返回了很多属性错误。
words = df['col'].split()
df['totalwords'] = len(words)
results in
结果是
AttributeError: 'Series' object has no attribute 'split'
and
和
f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)
results in
结果是
AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')
回答by cs95
str.split
+ str.len
str.split
+ str.len
str.len
works nicely for any non-numeric column.
str.len
适用于任何非数字列。
df['totalwords'] = df['col'].str.split().str.len()
str.count
str.count
If your words are single-space separated, you may simply count the spaces plus 1.
如果你的单词是单空格分隔的,你可以简单地计算空格加 1。
df['totalwords'] = df['col'].str.count(' ') + 1
List Comprehension
列表理解
This is faster than you think!
这比你想象的要快!
df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]
回答by sacuL
Here is a way using .apply()
:
这是一种使用方法.apply()
:
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
example
例子
Given this df
:
鉴于此df
:
>>> df
col
0 This is one sentence
1 and another
After applying the .apply()
应用后 .apply()
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
>>> df
col number_of_words
0 This is one sentence 4
1 and another 2
Note: As pointed out by in comments, and in this answer, .apply
is not necessarily the fastest method. If speed is important, better go with one of @c???s????'smethods.
注意:正如评论中所指出的,在这个答案中,.apply
不一定是最快的方法。如果速度很重要,最好使用@c???s???? 的方法之一。
回答by jpp
This is one way using pd.Series.str.split
and pd.Series.map
:
这是使用pd.Series.str.split
and 的一种方式pd.Series.map
:
df['word_count'] = df['col'].str.split().map(len)
The above assumes that df['col']
is a series of strings.
以上假设df['col']
是一系列字符串。
Example:
例子:
df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})
df['word_count'] = df['col'].str.split().map(len)
print(df)
# col word_count
# 0 This is an example 4
# 1 This is another 3
# 2 A third 2
回答by YOBEN_S
With list
and map
data from cold
随着list
和map
冷数据
list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]