Python Pandas 根据另一个数据框中的匹配列填充新的数据框列

Question

提问by user3471881

I have a dfwhich contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my dfcalled category. The categoryis a columnin df2which contains around 700 rowsand two other columnsthat will match with two columnsin df.

我有一个df包含我的主要数据的 100 万rows. 我的主要数据也有 30 columns。现在我想在我的df调用中添加另一列category。该category是column在df2其中包含约700rows和两个其他columns将搭配2columns中df。

I begin with setting an indexin df2and dfthat will match between the frames, however some of the indexin df2doesn't exist in df.

我首先设置一个index中df2和df，将帧之间的匹配，但是部分的index中df2并不存在df。

The remaining columns in df2are called AUTHOR_NAMEand CATEGORY.

中的其余列df2称为AUTHOR_NAMEand CATEGORY。

The relevant column in dfis called AUTHOR_NAME.

中的相关列df称为AUTHOR_NAME。

Some of the AUTHOR_NAMEin dfdoesn't exist in df2and vice versa.

一些AUTHOR_NAMEindf不存在，df2反之亦然。

The instruction I want is: when indexin dfmatches with indexin df2and titlein dfmatches with titlein df2, add categoryto df, else add NaN in category.

我要的指令是：当index在df比赛用index的df2和title在df比赛用title的df2，添加category到df，否则在加的NaN category。

Example data:

示例数据：

df2
           AUTHOR_NAME              CATEGORY
Index       
Pub1        author1                 main
Pub2        author1                 main
Pub3        author1                 main
Pub1        author2                 sub
Pub3        author2                 sub
Pub2        author4                 sub


df
            AUTHOR_NAME     ...n amount of other columns        
Index       
Pub1        author1                 
Pub2        author1     
Pub1        author2 
Pub1        author3
Pub2        author4 

expected_result
            AUTHOR_NAME             CATEGORY   ...n amount of other columns
Index
Pub1        author1                 main
Pub2        author1                 main
Pub1        author2                 sub
Pub1        author3                 NaN
Pub2        author4                 sub

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])my dfbecomes three times bigger than it is supposed to be.

如果我使用df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])我的df会变成比预期大三倍。

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2as a lookup table and then return typevalues to dfdepending on if certain conditions are met.

所以我想也许合并是解决这个问题的错误方法。我真正想做的是df2用作查找表，然后根据是否满足某些条件返回type值df。

def calculate_category(df2, d):
    category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
    return str(category_row['CATEGORY'].iat[0])

df.apply(lambda d: calculate_category(df2, d), axis=1)

However, this throws me an error:

但是，这给我带来了一个错误：

IndexError: ('index out of bounds', u'occurred at index 7614')

Answer 1

回答by piRSquared

Consider the following dataframes dfand df2

考虑以下数据帧df和df2

df = pd.DataFrame(dict(
        AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
        title=      list('zyxwvutsrqponml')
    ))

df2 = pd.DataFrame(dict(
        AUTHOR_NAME=list('AABCCEGG'),
        title      =list('zwvtrpml'),
        CATEGORY   =list('11223344')
    ))

option 1
merge

选项1
merge

df.merge(df2, how='left')

option 2
join

选项 2
join

cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)

both options yield

两种选择都产生

Answer 2

回答by Nickil Maveli

APPROACH 1:

方法 1：

You could use concatinstead and drop the duplicated values present in both Indexand AUTHOR_NAMEcolumns combined. After that, use isinfor checking membership:

您可以concat改为使用并删除Index和AUTHOR_NAME列组合中存在的重复值。之后，isin用于检查成员资格：

df_concat = pd.concat([df2, df]).reset_index().drop_duplicates(['Index', 'AUTHOR_NAME'])
df_concat.set_index('Index', inplace=True)
df_concat[df_concat.index.isin(df.index)]

Note: The column Indexis assumed to be set as the index column for both the DF's.

注意：Index假定该列被设置为DF's.

APPROACH 2:

方法 2：

Use joinafter setting the index column correctly as shown:

join正确设置索引列后使用，如图：

df2.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
df.set_index(['Index', 'AUTHOR_NAME'], inplace=True)

df.join(df2).reset_index()

Answer 3

回答by kiltannen

While the other answers here give very good and elegant solutions to the asked question, I have found a resource that both answers this question in an extremely elegant fashion, as well as giving a beautifully clear and straightforward set of examples on how to accomplish join/ merge of dataframes, effectively teaching LEFT, RIGHT, INNER and OUTER joins.

虽然这里的其他答案为提出的问题提供了非常好的和优雅的解决方案，但我找到了一个资源，它既能以极其优雅的方式回答这个问题，也能提供一组关于如何完成 join/ 的漂亮清晰和直接的示例合并数据框，有效地教授 LEFT、RIGHT、INNER 和 OUTER 连接。

Join And Merge Pandas Dataframe

加入和合并 Pandas 数据框

I honestly feel any further seekers after this topic will want to also examine his examples...

老实说，我觉得在这个主题之后任何进一步的寻求者也想检查他的例子......

Answer 4

回答by Bhagabat Behera

You may try the following. It will merge both the datasets on specified column as key.

您可以尝试以下方法。它将合并指定列上的两个数据集作为键。

expected_result = pd.merge(df, df2, on = 'CATEGORY', how = 'left')

Answer 5

回答by NikoTumi

Try

尝试

df = df.combine_first(df2)

Python Pandas 根据另一个数据框中的匹配列填充新的数据框列

提问by user3471881

回答by piRSquared

回答by Nickil Maveli

回答by kiltannen

回答by Bhagabat Behera

回答by NikoTumi

相关推荐

最近更新

标签

Python Pandas 根据另一个数据框中的匹配列填充新的数据框列

提问by user3471881

回答by piRSquared

回答by Nickil Maveli

回答by kiltannen

回答by Bhagabat Behera

回答by NikoTumi

相关推荐

Python 单例数组数组（<function train at 0x7f3a311320d0>, dtype=object）不能被视为有效集合

如何使用 NumPy 在 Python 中读取二进制文件？

Python 升级到 pip 版本 9.0.1

Python 无法通过 pip install 安装 psycopg2 包...这是因为 Sierra 吗？

相关推荐

最近更新

标签