Python 如何计算DataFrame中字符串中的单词数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37483470/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:26:40  来源:igfitidea点击:

How to calculate number of words in a string in DataFrame?

pythonpandasdataframe

提问by Sergei

Suppose we have simple Dataframe

假设我们有简单的 Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']

how to calculate number of words in keywords, similar to:

如何计算关键字中的字数,类似于:

1 word: 2
2 words: 2
3 words: 1
4 words: 1

回答by EdChum

IIUC then you can do the following:

IIUC 然后您可以执行以下操作:

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Here we use the vectorised str.splitto split on spaces, and then applylento get the count of the number of elements, we can then call value_countsto aggregate the frequency count.

这里我们使用向量化str.split在空间上进行分割,然后获取元素数量的计数,然后我们可以调用聚合频率计数。applylenvalue_counts

We then rename the index and sort it to get the desired output

然后我们重命名索引并对其进行排序以获得所需的输出

UPDATE

更新

This can also be done using str.lenrather than applywhich should scale better:

这也可以使用str.len而不是apply哪个应该更好地扩展:

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

Timings

时间安排

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 μs per loop
1000 loops, best of 3: 347 μs per loop

For a 6K df:

对于 6K df:

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

回答by Zero

You could use str.countwith space ' 'as delimiter.

您可以使用str.count空格' '作为分隔符。

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)

In [1717]: count.index = count.index.astype('str') + ' words:'

In [1718]: count
Out[1718]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64


Timings

时间安排

str.countis marginally faster

str.count稍微快一点

Small

小的

In [1724]: df.shape
Out[1724]: (6, 1)

In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 μs per loop

In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 μs per loop

Medium

中等的

In [1728]: df.shape
Out[1728]: (6000, 1)

In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop

In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop

Large

大的

In [1732]: df.shape
Out[1732]: (60000, 1)

In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop

In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop