pandas Python:计算数据帧列中所有行中特定字符的实例
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32147429/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Count instances of a specific character in all rows within a dataframe column
提问by bluechips
I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']
我有一个包含列 ['toaddress', 'ccaddress', 'body'] 的数据框 (df)
I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '@' within each field in those two columns
我想遍历数据帧的索引,以通过计算这两列中每个字段中的 和 '@' 实例来确定 toaddress 和 ccaddress 字段中电子邮件地址的最小、最大和平均数量
If all else fails, i guess I could just use df.toaddress.str.contains(r'@').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 @ sign.
如果所有其他方法都失败了,我想我可以使用 df.toaddress.str.contains(r'@').sum() 并将其除以数据框中的行数以获得平均值,但我认为这只是计算至少有 1 个 @ 符号的行。
采纳答案by ely
You can use
您可以使用
df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '@'))
to get back the count of '@'within each cell.
取回'@'每个单元格内的计数。
Then you can just compute the pandas max, min, and meanalong the row axis in the result.
然后,您可以在结果中沿行轴计算 pandas max、min和mean。
As I commented on the original question, you already suggested using df.toaddress.str.contains(r'@').sum()-- why not use df.toaddress.str.count(r'@')if you're happy going column by column instead of the method I showed above?
正如我对原始问题的评论,您已经建议使用df.toaddress.str.contains(r'@').sum()--df.toaddress.str.count(r'@')如果您乐于逐列而不是我上面显示的方法,为什么不使用?
回答by memebrain
This answer uses https://pypi.python.org/pypi/fake-factoryto generate the test data
此答案使用https://pypi.python.org/pypi/fake-factory生成测试数据
import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()
def emails():
emailAdd = [fake.email()]
for x in range(randint(0,3)):
emailAdd.append(fake.email())
return emailAdd
df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])
for extra in range(10):
df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)
print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))
The last 2 lines is the part that counts your emails. I wasn't sure if you wanted to check for '@' specifically, maybe you can use fake-factory to generate some test data as an example?
最后两行是计算电子邮件数量的部分。我不确定您是否要专门检查“@”,也许您可以使用 fake-factory 生成一些测试数据作为示例?
回答by Dmitry Rubanovich
len(filter(lambda df: df.toaddress.str.contains(r'@'),rows))
or even
甚至
len(filter(lambda df: r'@' in str(df.toaddress), rows))
回答by Joseph Stover
Perhaps something like this
也许像这样
from pandas import *
import re
df = DataFrame({"emails": ["[email protected], [email protected]",
"[email protected], none, [email protected], [email protected]"]})
at = re.compile(r"@", re.I)
def count_emails(string):
count = 0
for i in at.finditer(string):
count += 1
return count
df["count"] = df["emails"].map(count_emails)
df
Returns:
返回:
emails count
0 "[email protected], [email protected]" 2
1 "[email protected], none, [email protected], Th..." 3

