pandas 如何从熊猫数据帧创建一个词袋
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46360435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a bag of words from a pandas dataframe
提问by Nabih Ibrahim Bawazir
Here's my dataframe
这是我的数据框
CATEGORY BRAND
0 Noodle Anak Mas
1 Noodle Anak Mas
2 Noodle Indomie
3 Noodle Indomie
4 Noodle Indomie
23 Noodle Indomie
24 Noodle Mi Telor Cap 3
25 Noodle Mi Telor Cap 3
26 Noodle Pop Mie
27 Noodle Pop Mie
...
I already make sure that df type is string, my code is
我已经确定 df 类型是字符串,我的代码是
df = data[['CATEGORY', 'BRAND']].astype(str)
import collections, re
texts = df
bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
for txt in texts]
sumbags = sum(bagsofwords, collections.Counter())
When I call
当我打电话
sumbags
The output is
输出是
Counter({'BRAND': 1, 'CATEGORY': 1})
I want all of the data count in sumbags, except the title, to make it clear something like
我希望 sumbags 中的所有数据计数,除了标题,要清楚一些类似的东西
Counter({'Noodle': 10, 'Indomie': 4, 'Anak': 2, ....}) # because it is bag of words
I need every 1 word counts
我需要每 1 个单词计数
采纳答案by Zero
IIUIC, use
IIUIC,使用
Option 1]Numpy flatten
and split
选项 1]Numpyflatten
和split
In [2535]: collections.Counter([y for x in df.values.flatten() for y in x.split()])
Out[2535]:
Counter({'3': 2,
'Anak': 2,
'Cap': 2,
'Indomie': 4,
'Mas': 2,
'Mi': 2,
'Mie': 2,
'Noodle': 10,
'Pop': 2,
'Telor': 2})
Option 2]Use value_counts()
选项 2]使用value_counts()
In [2536]: pd.Series([y for x in df.values.flatten() for y in x.split()]).value_counts()
Out[2536]:
Noodle 10
Indomie 4
Mie 2
Pop 2
Anak 2
Mi 2
Cap 2
Telor 2
Mas 2
3 2
dtype: int64
Options 3]Use stack
and value_counts
选项 3]使用stack
和value_counts
In [2582]: df.apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()
Out[2582]:
Noodle 10
Indomie 4
Mie 2
Pop 2
Anak 2
Mi 2
Cap 2
Telor 2
Mas 2
3 2
dtype: int64
Details
细节
In [2516]: df
Out[2516]:
CATEGORY BRAND
0 Noodle Anak Mas
1 Noodle Anak Mas
2 Noodle Indomie
3 Noodle Indomie
4 Noodle Indomie
23 Noodle Indomie
24 Noodle Mi Telor Cap 3
25 Noodle Mi Telor Cap 3
26 Noodle Pop Mie
27 Noodle Pop Mie