Python Pandas 根据另一个数据框中的匹配列填充新的数据框列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39816671/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:45:02  来源:igfitidea点击:

Pandas populate new dataframe column based on matching columns in another dataframe

pythonpandasmergepopulate

提问by user3471881

I have a dfwhich contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my dfcalled category. The categoryis a columnin df2which contains around 700 rowsand two other columnsthat will match with two columnsin df.

我有一个df包含我的主要数据的 100 万rows. 我的主要数据也有 30 columns。现在我想在我的df调用中添加另一列category。该categorycolumndf2其中包含约700rows和两个其他columns将搭配2columnsdf

I begin with setting an indexin df2and dfthat will match between the frames, however some of the indexin df2doesn't exist in df.

我首先设置一个indexdf2df,将帧之间的匹配,但是部分的indexdf2并不存在df

The remaining columns in df2are called AUTHOR_NAMEand CATEGORY.

中的其余列df2称为AUTHOR_NAMEand CATEGORY

The relevant column in dfis called AUTHOR_NAME.

中的相关列df称为AUTHOR_NAME

Some of the AUTHOR_NAMEin dfdoesn't exist in df2and vice versa.

一些AUTHOR_NAMEindf不存在,df2反之亦然。

The instruction I want is: when indexin dfmatches with indexin df2and titlein dfmatches with titlein df2, add categoryto df, else add NaN in category.

我要的指令是:当indexdf比赛用indexdf2titledf比赛用titledf2,添加categorydf,否则在加的NaN category

Example data:

示例数据:

df2
           AUTHOR_NAME              CATEGORY
Index       
Pub1        author1                 main
Pub2        author1                 main
Pub3        author1                 main
Pub1        author2                 sub
Pub3        author2                 sub
Pub2        author4                 sub


df
            AUTHOR_NAME     ...n amount of other columns        
Index       
Pub1        author1                 
Pub2        author1     
Pub1        author2 
Pub1        author3
Pub2        author4 

expected_result
            AUTHOR_NAME             CATEGORY   ...n amount of other columns
Index
Pub1        author1                 main
Pub2        author1                 main
Pub1        author2                 sub
Pub1        author3                 NaN
Pub2        author4                 sub

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])my dfbecomes three times bigger than it is supposed to be.

如果我使用df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])我的df会变成比预期大三倍。

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2as a lookup table and then return typevalues to dfdepending on if certain conditions are met.

所以我想也许合并是解决这个问题的错误方法。我真正想做的是df2用作查找表,然后根据是否满足某些条件返回typedf

def calculate_category(df2, d):
    category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
    return str(category_row['CATEGORY'].iat[0])

df.apply(lambda d: calculate_category(df2, d), axis=1)

However, this throws me an error:

但是,这给我带来了一个错误:

IndexError: ('index out of bounds', u'occurred at index 7614')

回答by piRSquared

Consider the following dataframes dfand df2

考虑以下数据帧dfdf2

df = pd.DataFrame(dict(
        AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
        title=      list('zyxwvutsrqponml')
    ))

df2 = pd.DataFrame(dict(
        AUTHOR_NAME=list('AABCCEGG'),
        title      =list('zwvtrpml'),
        CATEGORY   =list('11223344')
    ))

option 1
merge

选项1
merge

df.merge(df2, how='left')

option 2
join

选项 2
join

cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)


both options yield

两种选择都产生

enter image description here

在此处输入图片说明

回答by Nickil Maveli

APPROACH 1:

方法 1:

You could use concatinstead and drop the duplicated values present in both Indexand AUTHOR_NAMEcolumns combined. After that, use isinfor checking membership:

您可以concat改为使用并删除IndexAUTHOR_NAME列组合中存在的重复值。之后,isin用于检查成员资格:

df_concat = pd.concat([df2, df]).reset_index().drop_duplicates(['Index', 'AUTHOR_NAME'])
df_concat.set_index('Index', inplace=True)
df_concat[df_concat.index.isin(df.index)]

Image

图片

Note: The column Indexis assumed to be set as the index column for both the DF's.

注意:Index假定该列被设置为DF's.



APPROACH 2:

方法 2:

Use joinafter setting the index column correctly as shown:

join正确设置索引列后使用,如图:

df2.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
df.set_index(['Index', 'AUTHOR_NAME'], inplace=True)

df.join(df2).reset_index()

Image

图片

回答by kiltannen

While the other answers here give very good and elegant solutions to the asked question, I have found a resource that both answers this question in an extremely elegant fashion, as well as giving a beautifully clear and straightforward set of examples on how to accomplish join/ merge of dataframes, effectively teaching LEFT, RIGHT, INNER and OUTER joins.

虽然这里的其他答案为提出的问题提供了非常好的和优雅的解决方案,但我找到了一个资源,它既能以极其优雅的方式回答这个问题,也能提供一组关于如何完成 join/ 的漂亮清晰和直接的示例合并数据框,有效地教授 LEFT、RIGHT、INNER 和 OUTER 连接。

Join And Merge Pandas Dataframe

加入和合并 Pandas 数据框

I honestly feel any further seekers after this topic will want to also examine his examples...

老实说,我觉得在这个主题之后任何进一步的寻求者也想检查他的例子......

回答by Bhagabat Behera

You may try the following. It will merge both the datasets on specified column as key.

您可以尝试以下方法。它将合并指定列上的两个数据集作为键。

expected_result = pd.merge(df, df2, on = 'CATEGORY', how = 'left')

回答by NikoTumi

Try

尝试

df = df.combine_first(df2)