Python 在逐行迭代时更新 Pandas 中的数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23330654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:46:08  来源:igfitidea点击:

Update a dataframe in pandas while iterating row by row

pythonpandasupdatesdataframe

提问by AMM

I have a pandas data frame that looks like this (its a pretty big one)

我有一个看起来像这样的熊猫数据框(它很大)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

now I would like to iterate row by row and as I go through each row, the value of iforin each row can change depending on some conditions and I need to lookup another dataframe.

现在我想逐行迭代,当我遍历每一行时,每一行中的值ifor可能会根据某些条件而改变,我需要查找另一个数据帧。

Now, how do I update this as I iterate. Tried a few things none of them worked.

现在,我如何在迭代时更新它。尝试了一些事情,他们都没有工作。

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

None of these approaches seem to work. I don't see the values updated in the dataframe.

这些方法似乎都不起作用。我没有看到数据框中更新的值。

回答by CT Zhu

You should assign value by df.ix[i, 'exp']=Xor df.loc[i, 'exp']=Xinstead of df.ix[i]['ifor'] = x.

您应该通过df.ix[i, 'exp']=Xdf.loc[i, 'exp']=X而不是 来分配值df.ix[i]['ifor'] = x

Otherwise you are working on a view, and should get a warming:

否则,您正在处理视图,并且应该得到加热:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

But certainly, loop probably should better be replaced by some vectorized algorithm to make the full use of DataFrameas @Phillip Cloud suggested.

但当然,循环可能最好用一些矢量化算法代替,以充分利用DataFrame@Phillip Cloud 的建议。

回答by rakke

You can assign values in the loop using df.set_value:

您可以使用 df.set_value 在循环中分配值:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

If you don't need the row values you could simply iterate over the indices of df, but I kept the original for-loop in case you need the row value for something not shown here.

如果您不需要行值,您可以简单地遍历 df 的索引,但我保留了原始 for 循环,以防您需要此处未显示的行值。

update

更新

df.set_value() has been deprecated since version 0.21.0 you can use df.at() instead:

df.set_value() 自版本 0.21.0 起已被弃用,您可以使用 df.at() 代替:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

回答by GoingMyWay

A method you can use is itertuples(), it iterates over DataFrame rows as namedtuples, with index value as first element of the tuple. And it is much much faster compared with iterrows(). For itertuples(), each rowcontains its Indexin the DataFrame, and you can use locto set the value.

您可以使用的一种方法是itertuples(),它将 DataFrame 行作为命名元组进行迭代,并将索引值作为元组的第一个元素。与iterrows(). 对于itertuples(),每个都row包含Index在 DataFrame 中,您可以使用它loc来设置值。

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

Thanks @SantiStSupery, using .atis much faster.

感谢@SantiStSupery,使用.at速度要快得多

回答by piRSquared

Pandas DataFrame object should be thought of as a Series of Series. In other words, you should think of it in terms of columns. The reason why this is important is because when you use pd.DataFrame.iterrowsyou are iterating through rows as Series. But these are notthe Series that the data frame is storing and so they are new Series that are created for you while you iterate. That implies that when you attempt to assign tho them, those edits won't end up reflected in the original data frame.

Pandas DataFrame 对象应该被认为是一个系列的系列。换句话说,您应该根据列来考虑它。这很重要的原因是因为当您使用时,您将pd.DataFrame.iterrows行作为系列进行迭代。但这些不是数据框正在存储的系列,因此它们是在您迭代时为您创建的新系列。这意味着当您尝试分配它们时,这些编辑最终不会反映在原始数据框中。

Ok, now that that is out of the way: What do we do?

好的,现在已经不碍事了:我们该怎么办?

Suggestions prior to this post include:

在这篇文章之前的建议包括:

  1. pd.DataFrame.set_valueis deprecated as of Pandas version 0.21
  2. pd.DataFrame.ixis deprecated
  3. pd.DataFrame.locis fine but can work on array indexersand you can do better
  1. pd.DataFrame.set_value弃用的熊猫版0.21
  2. pd.DataFrame.ix弃用
  3. pd.DataFrame.loc很好,但可以在数组索引器上工作,你可以做得更好

My recommendation
Use pd.DataFrame.at

我的建议
使用pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

You can even change this to:

您甚至可以将其更改为:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y


Response to comment

回复评论

and what if I need to use the value of the previous row for the if condition?

如果我需要将前一行的值用于 if 条件呢?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

回答by Duane

for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

回答by Pranzell

Well, if you are going to iterate anyhow, why don't use the simplest method of all, df['Column'].values[i]

好吧,如果你无论如何都要迭代,为什么不使用最简单的方法, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

Or if you want to compare the new values with old or anything like that, why not store it in a list and then append in the end.

或者,如果您想将新值与旧值或类似值进行比较,为什么不将其存储在列表中,然后将其追加到最后。

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

回答by Shazir Jabbar

Increment the MAX number from a column. For Example :

从一列增加 MAX 数。例如 :

df1 = [sort_ID, Column1,Column2]
print(df1)

My output :

我的输出:

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h


MAX = df1['Sort_ID'].max() #This returns my Max Number 

Now , I need to create a column in df2 and fill the column values which increments the MAX .

现在,我需要在 df2 中创建一列并填充增加 MAX 的列值。

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1


Note : df2 will initially contain only the Column1 and Column2 . we need the Sortid column to be created and incremental of the MAX from df1 .

注意:df2 最初将只包含 Column1 和 Column2。我们需要从 df1 创建 Sortid 列并增加 MAX 。