Python 如何计算DataFrame中字符串中的单词数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37483470/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate number of words in a string in DataFrame?
提问by Sergei
Suppose we have simple Dataframe
假设我们有简单的 Dataframe
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']
how to calculate number of words in keywords, similar to:
如何计算关键字中的字数,类似于:
1 word: 2
2 words: 2
3 words: 1
4 words: 1
回答by EdChum
IIUC then you can do the following:
IIUC 然后您可以执行以下操作:
In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[89]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
Here we use the vectorised str.split
to split on spaces, and then apply
len
to get the count of the number of elements, we can then call value_counts
to aggregate the frequency count.
这里我们使用向量化str.split
在空间上进行分割,然后获取元素数量的计数,然后我们可以调用聚合频率计数。apply
len
value_counts
We then rename the index and sort it to get the desired output
然后我们重命名索引并对其进行排序以获得所需的输出
UPDATE
更新
This can also be done using str.len
rather than apply
which should scale better:
这也可以使用str.len
而不是apply
哪个应该更好地扩展:
In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[41]:
0 words: 2
1 words: 1
2 words: 3
3 words: 4
4 words: 2
5 words: 1
Name: fruits, dtype: int64
Timings
时间安排
In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
1000 loops, best of 3: 799 μs per loop
1000 loops, best of 3: 347 μs per loop
For a 6K df:
对于 6K df:
In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop
回答by Zero
You could use str.count
with space ' '
as delimiter.
您可以使用str.count
空格' '
作为分隔符。
In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)
In [1717]: count.index = count.index.astype('str') + ' words:'
In [1718]: count
Out[1718]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
Timings
时间安排
str.count
is marginally faster
str.count
稍微快一点
Small
小的
In [1724]: df.shape
Out[1724]: (6, 1)
In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 μs per loop
In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 μs per loop
Medium
中等的
In [1728]: df.shape
Out[1728]: (6000, 1)
In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop
In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop
Large
大的
In [1732]: df.shape
Out[1732]: (60000, 1)
In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop
In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop