pandas 大熊猫在行上迭代作为字典

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53297261/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:09:31  来源:igfitidea点击:

pandas iteration over rows as dict

pythonpandasperformance

提问by Matina G

Helllo,

你好,

I need to iterate over a pandas dataframe in order to pass each row as argument of a function (actually, class constructor) with **kwargs. This means that each row should behave as a dictionary with keys the column names and values the corresponding ones for each row.

我需要遍历 Pandas 数据框,以便将每一行作为函数(实际上是类构造函数)的参数传递给**kwargs. 这意味着每一行都应该像一个字典,键是列名,值是每行对应的值。

This works, but it performs very badly:

这有效,但它的表现非常糟糕:

import pandas as pd


def myfunc(**kwargs):
    try:
        area = kwargs.get('length', 0)* kwargs.get('width', 0)
        return area
    except TypeError:
        return 'Error : length and width should be int or float'


df = pd.DataFrame({'length':[1,2,3], 'width':[10, 20, 30]})

for i in range(len(df)):
    print myfunc(**df.iloc[i])

Any suggestions on how to make that more performing ? I have tried iterating with tried df.iterrows(), but I get the following error :

关于如何提高性能的任何建议?我尝试使用 try 进行迭代df.iterrows(),但出现以下错误:

TypeError: myfunc() argument after ** must be a mapping, not tuple

类型错误:** 之后的 myfunc() 参数必须是映射,而不是元组

I have also tried df.itertuples()and df.values, but either I am missing something, or it means that I have to convert each tuple / np.array to a pd.Series or dict , which will also be slow. My constraint is that the script has to work with python 2.7 and pandas 0.14.1.

我也试过df.itertuples()and df.values,但要么我遗漏了一些东西,要么意味着我必须将每个元组/ np.array 转换为 pd.Series 或 dict ,这也会很慢。我的限制是脚本必须使用 python 2.7 和 pandas 0.14.1。

Thanks in advance for your help!

在此先感谢您的帮助!

采纳答案by stellasia

You can try:

你可以试试:

for k, row in df.iterrows():
    myfunc(**row)

Here kis the dataframe index and rowis a dict, so you can access any column with: row["my_column_name"]

k是数据框索引并且row是一个字典,因此您可以使用以下方式访问任何列:row["my_column_name"]

回答by avloss

one clean option is this one:

一个干净的选择是这个:

for row_dict in df.to_dict(orient="row"):
    print(row_dict['column_name']

回答by jpp

Defining a separate function for this will be inefficient, as you are applying row-wise calculations. More efficient would be to calculate a new series, then iterate the series:

为此定义一个单独的函数将是低效的,因为您正在应用逐行计算。更有效的是计算一个新系列,然后迭代该系列:

df = pd.DataFrame({'length':[1,2,3,'test'], 'width':[10, 20, 30,'hello']})

df2 = df.iloc[:].apply(pd.to_numeric, errors='coerce')

error_str = 'Error : length and width should be int or float'
print(*(df2['length'] * df2['width']).fillna(error_str), sep='\n')

10.0
40.0
90.0
Error : length and width should be int or float