pandas Python：计算数据帧列中所有行中特定字符的实例

Question

提问by bluechips

I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']

我有一个包含列 ['toaddress', 'ccaddress', 'body'] 的数据框 (df)

I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '@' within each field in those two columns

我想遍历数据帧的索引，以通过计算这两列中每个字段中的和 '@' 实例来确定 toaddress 和 ccaddress 字段中电子邮件地址的最小、最大和平均数量

If all else fails, i guess I could just use df.toaddress.str.contains(r'@').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 @ sign.

如果所有其他方法都失败了，我想我可以使用 df.toaddress.str.contains(r'@').sum() 并将其除以数据框中的行数以获得平均值，但我认为这只是计算至少有 1 个 @ 符号的行。

Answer 1

采纳答案by ely

You can use

您可以使用

df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '@'))

to get back the count of '@'within each cell.

取回'@'每个单元格内的计数。

Then you can just compute the pandas max, min, and meanalong the row axis in the result.

然后，您可以在结果中沿行轴计算 pandas max、min和mean。

As I commented on the original question, you already suggested using df.toaddress.str.contains(r'@').sum()-- why not use df.toaddress.str.count(r'@')if you're happy going column by column instead of the method I showed above?

正如我对原始问题的评论，您已经建议使用df.toaddress.str.contains(r'@').sum()--df.toaddress.str.count(r'@')如果您乐于逐列而不是我上面显示的方法，为什么不使用？

Answer 2

回答by memebrain

This answer uses https://pypi.python.org/pypi/fake-factoryto generate the test data

此答案使用https://pypi.python.org/pypi/fake-factory生成测试数据

import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()

def emails():
    emailAdd = [fake.email()]
    for x in range(randint(0,3)):
        emailAdd.append(fake.email())

    return emailAdd

df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])

for extra in range(10):
    df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)

print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))

The last 2 lines is the part that counts your emails. I wasn't sure if you wanted to check for '@' specifically, maybe you can use fake-factory to generate some test data as an example?

最后两行是计算电子邮件数量的部分。我不确定您是否要专门检查“@”，也许您可以使用 fake-factory 生成一些测试数据作为示例？

Answer 3

回答by Dmitry Rubanovich

len(filter(lambda df: df.toaddress.str.contains(r'@'),rows))

or even

甚至

len(filter(lambda df: r'@' in str(df.toaddress), rows))

Answer 4

回答by Joseph Stover

Perhaps something like this

也许像这样

from pandas import *
import re

df = DataFrame({"emails": ["[email protected], [email protected]", 
                           "[email protected], none, [email protected], [email protected]"]})

at = re.compile(r"@", re.I)
def count_emails(string):
    count = 0
    for i in at.finditer(string):
        count += 1
    return count

df["count"] = df["emails"].map(count_emails)

df

Returns:

返回：

    emails                                                  count
0   "[email protected], [email protected]"                     2
1   "[email protected], none, [email protected], Th..."     3

pandas Python：计算数据帧列中所有行中特定字符的实例

提问by bluechips

采纳答案by ely

回答by memebrain

回答by Dmitry Rubanovich

回答by Joseph Stover

相关推荐

最近更新

标签

pandas Python：计算数据帧列中所有行中特定字符的实例

提问by bluechips

采纳答案by ely

回答by memebrain

回答by Dmitry Rubanovich

回答by Joseph Stover

相关推荐

pandas 如何按列减少熊猫数据框？

pandas 根据日期时间列切片熊猫数据框

Pandas：在列中查找最小值，将包含该列的行写入新的数据帧

5000 万行的 Pandas groupby+transform 需要 3 小时

相关推荐

最近更新

标签