Python 如何计算DataFrame中字符串中的单词数？

Question

提问by Sergei

Suppose we have simple Dataframe

假设我们有简单的 Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']

how to calculate number of words in keywords, similar to:

如何计算关键字中的字数，类似于：

1 word: 2
2 words: 2
3 words: 1
4 words: 1

Answer 1

回答by EdChum

IIUC then you can do the following:

IIUC 然后您可以执行以下操作：

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Here we use the vectorised str.splitto split on spaces, and then applylento get the count of the number of elements, we can then call value_countsto aggregate the frequency count.

这里我们使用向量化str.split在空间上进行分割，然后获取元素数量的计数，然后我们可以调用聚合频率计数。applylenvalue_counts

We then rename the index and sort it to get the desired output

然后我们重命名索引并对其进行排序以获得所需的输出

UPDATE

更新

This can also be done using str.lenrather than applywhich should scale better:

这也可以使用str.len而不是apply哪个应该更好地扩展：

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

Timings

时间安排

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 μs per loop
1000 loops, best of 3: 347 μs per loop

For a 6K df:

对于 6K df：

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

Answer 2

回答by Zero

You could use str.countwith space ' 'as delimiter.

您可以使用str.count空格' '作为分隔符。

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)

In [1717]: count.index = count.index.astype('str') + ' words:'

In [1718]: count
Out[1718]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Timings

时间安排

str.countis marginally faster

str.count稍微快一点

_Small

_小的

In [1724]: df.shape
Out[1724]: (6, 1)

In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 μs per loop

In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 μs per loop

_Medium

_中等的

In [1728]: df.shape
Out[1728]: (6000, 1)

In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop

In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop

_Large

_大的

In [1732]: df.shape
Out[1732]: (60000, 1)

In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop

In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop

Python 如何计算DataFrame中字符串中的单词数？

提问by Sergei

回答by EdChum

回答by Zero

相关推荐

最近更新

标签

Python 如何计算DataFrame中字符串中的单词数？

提问by Sergei

回答by EdChum

回答by Zero

相关推荐

Python：在当前目录中搜索文件及其所有父文件

Python 使用 pip 安装 pycrypto 失败

Python 安装 matplotlib 导致权限被拒绝错误

Python 如何使用 misc.imread 将图像切成红色、绿色和蓝色通道

相关推荐

最近更新

标签