pandas 如何从熊猫数据帧创建一个词袋

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46360435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:30:50  来源:igfitidea点击:

How to create a bag of words from a pandas dataframe

pythonpandas

提问by Nabih Ibrahim Bawazir

Here's my dataframe

这是我的数据框

    CATEGORY    BRAND
0   Noodle  Anak Mas
1   Noodle  Anak Mas
2   Noodle  Indomie
3   Noodle  Indomie
4   Noodle  Indomie
23  Noodle  Indomie
24  Noodle  Mi Telor Cap 3
25  Noodle  Mi Telor Cap 3
26  Noodle  Pop Mie
27  Noodle  Pop Mie
...

I already make sure that df type is string, my code is

我已经确定 df 类型是字符串,我的代码是

df = data[['CATEGORY', 'BRAND']].astype(str)
import collections, re
texts = df
bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
sumbags = sum(bagsofwords, collections.Counter())

When I call

当我打电话

sumbags

The output is

输出是

 Counter({'BRAND': 1, 'CATEGORY': 1})

I want all of the data count in sumbags, except the title, to make it clear something like

我希望 sumbags 中的所有数据计数,除了标题,要清楚一些类似的东西

Counter({'Noodle': 10, 'Indomie': 4, 'Anak': 2, ....}) # because it is bag of words

I need every 1 word counts

我需要每 1 个单词计数

采纳答案by Zero

IIUIC, use

IIUIC,使用

Option 1]Numpy flattenand split

选项 1]Numpyflattensplit

In [2535]: collections.Counter([y for x in df.values.flatten() for y in x.split()])
Out[2535]:
Counter({'3': 2,
         'Anak': 2,
         'Cap': 2,
         'Indomie': 4,
         'Mas': 2,
         'Mi': 2,
         'Mie': 2,
         'Noodle': 10,
         'Pop': 2,
         'Telor': 2})

Option 2]Use value_counts()

选项 2]使用value_counts()

In [2536]: pd.Series([y for x in df.values.flatten() for y in x.split()]).value_counts()
Out[2536]:
Noodle     10
Indomie     4
Mie         2
Pop         2
Anak        2
Mi          2
Cap         2
Telor       2
Mas         2
3           2
dtype: int64

Options 3]Use stackand value_counts

选项 3]使用stackvalue_counts

In [2582]: df.apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()
Out[2582]:
Noodle     10
Indomie     4
Mie         2
Pop         2
Anak        2
Mi          2
Cap         2
Telor       2
Mas         2
3           2
dtype: int64


Details

细节

In [2516]: df
Out[2516]:
   CATEGORY           BRAND
0    Noodle        Anak Mas
1    Noodle        Anak Mas
2    Noodle         Indomie
3    Noodle         Indomie
4    Noodle         Indomie
23   Noodle         Indomie
24   Noodle  Mi Telor Cap 3
25   Noodle  Mi Telor Cap 3
26   Noodle         Pop Mie
27   Noodle         Pop Mie