根据其他列中的值替换列值,用于 Pandas 数据框中的所有行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45923738/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:20:29  来源:igfitidea点击:

Replace column value based on value in other column, for all rows in a pandas dataframe

pythonpandasreplaceapply

提问by Korzak

I am having trouble thinking pythonically about something, and would love some guidance.

我无法以 Python 方式思考某些事情,并且希望得到一些指导。

I have a dataframe that contains columns with dates of events at which certain files should be uploaded, and a column with the names of those events. So events can be X, Y, Z, and files can be 1, 2, 3.

我有一个数据框,其中包含包含应上传某些文件的事件日期的列,以及包含这些事件名称的列。所以事件可以是 X、Y、Z,文件可以是 1、2、3。

Not all files need to be uploaded at all events, i.e. if it's Event X, then files 1, 2, and 3 need to be uploaded, but if it's Event Y, then only file 3 needs to be uploaded. The date columns either have a date in them, or are blank.

并非所有事件都需要上传所有文件,即如果是事件 X,则需要上传文件 1、2 和 3,但如果是事件 Y,则只需要上传文件 3。日期列要么有日期,要么是空白的。

What I want to do is, for all the files for events that are not needed, replace blank with "Not Needed".

我想要做的是,对于所有不需要的事件文件,将空白替换为“不需要”。

Initial:

最初的:

    File1   File2  File3
X   Aug 1          Sept 1
X   Aug 3   Aug 4  Sept 9
Y                  Sept 10
Z   Aug 12
X   Aug 13  Aug 15
Z   Aug 1

Goal

目标

     File1   File2  File3
X   Aug 1          Sept 1
X   Aug 3   Aug 4  Sept 9
Y   NN      NN     Sept 10
Z   Aug 12  NN     NN
X   Aug 13  Aug 15
Z   Aug 1   NN     NN

So in other words, for the blanks that SHOULD be blank because a file is not expected, replace that value with "Not Needed", while leaving the other blanks alone.

因此,换句话说,对于由于不需要文件而应该为空白的空白,将该值替换为“不需要”,同时保留其他空白。

I have tried doing this with .replace(), .apply() with functions, and I am not having any success.

我试过用 .replace()、.apply() 和函数来做这件事,但我没有成功。

The code below sort of works, but it works not only when there is a match, but even when there is not a match.

下面的代码有效,但它不仅在匹配时有效,而且即使在不匹配时也有效。

Fill in descriptive text for scales not collected at certain visits (where upload dates would be blank)
df_combined['FAQ-Audio-upDate'] = np.where(df_combined['VisitName'] == "Screening", "FAQ Not Expected", "")
df_combined['FAQ-Form-upDate'] = np.where(df_combined['VisitName'] == "Screening", "FAQ Not Expected", "")

How can I change the value in one column based on the value in another column, across the entire dataframe? What I want is basically this:

如何在整个数据框中根据另一列中的值更改一列中的值?我想要的基本上是这样的:

For every row in the dataframe If the value in the VisitName column == X Change the value in ColumnA to "Not Expected" enter image description here
Thank you!!

对于数据框中的每一行,如果 VisitName 列中的值 == X 将 ColumnA 中的值更改为“Not Expected” 谢谢!!在此处输入图片说明

回答by Edward Brennan

So, I can answer your base question pretty easily, but there's a couple of style things I think you might want to change that I'd like to get into. I'm fairly certain this question has been addressed in other threads, but you've got a couple of problems wrapped up in 1 so I'm just going to address them here

所以,我可以很容易地回答你的基本问题,但是我认为你可能想要改变一些我想要改变的风格。我相当肯定这个问题已经在其他线程中解决了,但是你有几个问题包含在 1 中,所以我将在这里解决它们

For every row in the dataframe If the value in the VisitName column == X Change the value in ColumnA to "Not Expected"

对于数据框中的每一行,如果 VisitName 列中的值 == X 将 ColumnA 中的值更改为“Not Expected”

You want to be using index slices to set values. Get a boolean mask of the dataframe based on the logic you want, use that to create a new dataframe containing only the rows that you want to update, get the index of this new dataframe, and use this index with the original dataframe to change the values over.

您想使用索引切片来设置值。根据您想要的逻辑获取数据框的布尔掩码,使用它来创建一个仅包含您要更新的行的新数据框,获取此新数据框的索引,并将此索引与原始数据框一起使用以更改值结束。

    import pandas as pd
    df = pd.DataFrame(data=None, index=["X", "Y", "Z"], columns=["VisitName",
    "ColumnA", "ColumnB"])

    not_expected_index = df[df.loc[:, "VisitName"] == "X"].index
    df.loc[not_expected_index, "ColumnA"] = "Not Expected"

This is the preferred way in pandas to change values in a DataFrame based on other values in another column.

这是 Pandas 中基于另一列中的其他值更改 DataFrame 中的值的首选方法。

Now, there's a couple of things about the original DataFrame you posted that I'd like to mention. First, if you already have Null values in the dataframe cells, then you can use the pandas dataframe fillna methodto populate these values.

现在,关于您发布的原始 DataFrame 有几件事我想提一下。首先,如果数据帧单元格中已经有 Null 值,那么您可以使用Pandas 数据帧 fillna 方法来填充这些值。

    df.fillna("Not Expected")

Second, why do you want to use the string "NN" or "Not Needed" over the default Null value? For any operations within pandas, I prefer to stick with the actual null values, so that you can use aggregation functions like sum or count freely on dataframes with null values.

其次,为什么要在默认的 Null 值上使用字符串“NN”或“Not Needed”?对于 Pandas 中的任何操作,我更喜欢坚持使用实际的空值,以便您可以在具有空值的数据帧上自由使用 sum 或 count 等聚合函数。

Second, the index contains repeated values:

其次,索引包含重复值:

    df.index = ["X", "X", "Y", "Z", "X", "Z"]

Dataframes will allow duplicate index values, but they can behave in funny ways that you need to be aware of.

Dataframes 将允许重复的索引值,但它们可能会以有趣的方式运行,您需要注意。

For example:

例如:

    print(df)

returns

回报

        VisitName ColumnA ColumnB
    X       NaN     NaN     NaN
    X       NaN     NaN     NaN
    Y       NaN     NaN     NaN
    Z       NaN     NaN     NaN
    X       NaN     NaN     NaN
    Z       NaN     NaN     NaN

setting a value in VisitName for X

在 VisitName 中为 X 设置一个值

    df.loc["X", "VisitName"] = "test"

returns

回报

      VisitName ColumnA ColumnB
    X      "test"   NaN     NaN
    X      "test"   NaN     NaN
    Y       NaN     NaN     NaN
    Z       NaN     NaN     NaN
    X      "test"   NaN     NaN
    Z       NaN     NaN     NaN

If I were tackling this problem, I'd either use the date as the index, with a True or False value in a file's column depending on whether it needs to be sent or not on that date,

如果我要解决这个问题,我要么使用日期作为索引,在文件的列中使用 True 或 False 值,具体取决于是否需要在该日期发送,

     index       File1 File2 File3 
    8/01/17      True  False True
    8/08/17      False True  True
    8/15/17      True  True  False 
    8/24/17      False True  False 
    9/01/17      False False False 
    9/12/17      True  False True

or I'd just use an integer index, with a column for the date and a column for what file is being sent.

或者我只是使用一个整数索引,一列是日期,一列是要发送的文件。

  index    date     file
    0      8/01/17   1
    1      8/01/17   2
    2      8/08/17   2
    3      8/15/17   1
    4      8/15/17   2
    5      8/15/17   3

I mean, if you're locked into using the other setup, that's fine, but I think these would be easier dataframe setups to work with, since they'd support groupby methodsmore easily.

我的意思是,如果你被锁定在使用其他设置,那很好,但我认为这些数据框设置更容易使用,因为它们更容易支持groupby 方法

Also, keep in mind that if you're using a for loop, then you might as well not be using pandas. The whole point of pandas is that it uses C to speed up index operations. Never use

另外,请记住,如果您使用的是 for 循环,那么您最好不要使用 Pandas。pandas 的全部意义在于它使用 C 来加速索引操作。从不使用

    for row in df.index:
        df.loc[row, 'columna'] += 2.

Always use

一直使用

    df.loc[:, 'columna'] += 2.