Python 计算熊猫数据框中某些单词的出现次数

Question

提问by Nilani Algiriyage

I want to count number of occurrences of certain words in a data frame. I know using "str.contains"

我想计算数据框中某些单词的出现次数。我知道使用“str.contains”

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size()
n = a.apply(lambda x: 1).sum()

Currently I'm using the above code. Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.

目前我正在使用上面的代码。是否有匹配正则表达式并获取出现次数的方法？就我而言，我有一个大数据框，我想匹配大约 100 个字符串。

Answer 1

采纳答案by Andy Hayden

Update: Original answer counts those rows which contain a substring.

更新：原始答案计算那些包含子字符串的行。

To count all the occurrences of a substring you can use .str.count:

要计算子字符串的所有出现次数，您可以使用.str.count：

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])

In [22]: df.words.str.count("he|wo")
Out[22]:
0    1
1    1
2    2
Name: words, dtype: int64

In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4

The str.containsmethod accepts a regular expression:

该str.contains方法接受一个正则表达式：

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:

例如：

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

要计算出现次数，您可以对这个布尔系列求和：

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

Answer 2

回答by Dan Allan

To count the total number of matches, use s.str.match(...).str.get(0).count().

要计算匹配的总数，请使用s.str.match(...).str.get(0).count()。

If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

如果您的正则表达式将匹配几个独特的单词，要单独计算，请使用 s.str.match(...).str.get(0).groupby(lambda x: x).count()

It works like this:

它是这样工作的：

In [12]: s
Out[12]: 
0    ax
1    ay
2    bx
3    by
4    bz
dtype: object

The matchstring method handles regular expressions...

该match字符串的方法处理正则表达式...

In [13]: s.str.match('(b[x-y]+)')
Out[13]: 
0       []
1       []
2    (bx,)
3    (by,)
4       []
dtype: object

...but the results, as given, are not very convenient. The string method gettakes the matches as strings and converts empty results to NaNs...

...但结果，正如给定的，不是很方便。string 方法get将匹配项作为字符串并将空结果转换为 NaN...

In [14]: s.str.match('(b[x-y]+)').str.get(0)
Out[14]: 
0    NaN
1    NaN
2     bx
3     by
4    NaN
dtype: object

...which are not counted.

……不计算在内。

In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
Out[15]: 2

Python 计算熊猫数据框中某些单词的出现次数

提问by Nilani Algiriyage

采纳答案by Andy Hayden

回答by Dan Allan

相关推荐

最近更新

标签

Python 计算熊猫数据框中某些单词的出现次数

提问by Nilani Algiriyage

采纳答案by Andy Hayden

回答by Dan Allan

相关推荐

Python PySerial 非阻塞读取循环

Python 使用 openpyxl 用颜色填充单元格？

Python - 如何按每个列表中的第四个元素对列表列表进行排序？

Python 按绝对值排序而不改变数据

相关推荐

最近更新

标签