Python 合并两个 Pandas 数据框(加入一个公共列)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18792918/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:51:44  来源:igfitidea点击:

Combine two pandas Data Frames (join on a common column)

pythonpandasdataframemergeleft-join

提问by anonuser0428

I have 2 dataframes:

我有 2 个数据框:

restaurant_ids_dataframe

restaurant_ids_dataframe

Data columns (total 13 columns):
business_id      4503  non-null values
categories       4503  non-null values
city             4503  non-null values
full_address     4503  non-null values
latitude         4503  non-null values
longitude        4503  non-null values
name             4503  non-null values
neighborhoods    4503  non-null values
open             4503  non-null values
review_count     4503  non-null values
stars            4503  non-null values
state            4503  non-null values
type             4503  non-null values
dtypes: bool(1), float64(3), int64(1), object(8)`

and

restaurant_review_frame

restaurant_review_frame

Int64Index: 158430 entries, 0 to 229905
Data columns (total 8 columns):
business_id    158430  non-null values
date           158430  non-null values
review_id      158430  non-null values
stars          158430  non-null values
text           158430  non-null values
type           158430  non-null values
user_id        158430  non-null values
votes          158430  non-null values
dtypes: int64(1), object(7)

I would like to join these two DataFrames to make them into a single dataframe using the DataFrame.join() command in pandas.

我想使用 Pandas 中的 DataFrame.join() 命令将这两个 DataFrame 连接起来,使它们成为单个数据帧。

I have tried the following line of code:

我尝试了以下代码行:

#the following line of code creates a left join of restaurant_ids_frame and   restaurant_review_frame on the column 'business_id'
restaurant_review_frame.join(other=restaurant_ids_dataframe,on='business_id',how='left')

But when I try this I get the following error:

但是当我尝试这个时,我收到以下错误:

Exception: columns overlap: Index([business_id, stars, type], dtype=object)

I am very new to pandas and have no clue what I am doing wrong as far as executing the join statement is concerned.

我对熊猫很陌生,并且不知道就执行 join 语句而言我做错了什么。

any help would be much appreciated.

任何帮助将非常感激。

采纳答案by mlimb

You can use mergeto combine two dataframes into one:

您可以使用合并将两个数据帧合并为一个:

import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')

where onspecifies field name that exists in both dataframes to join on, and howdefines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As @DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y'). if you wanted to do something like star_restaurant_idand star_restaurant_review, you can do:

其中on指定存在于要加入的两个数据帧中的字段名称,以及如何定义其内部/外部/左/右连接,外部使用“来自两个框架的键的联合(SQL:完全外连接)”。由于您在两个数据框中都有“star”列,因此默认情况下这将在组合数据框中创建两列 star_x 和 star_y。正如@DanAllan 提到的 join 方法,您可以通过将其作为 kwarg 传递来修改合并的后缀。默认为suffixes=('_x', '_y')。如果你想做类似star_restaurant_idand 的事情star_restaurant_review,你可以这样做:

 pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))

The parameters are explained in detail in this link.

链接中详细解释了这些参数。

回答by Dan Allan

Joining fails if the DataFrames have some column names in common. The simplest way around it is to include an lsuffixor rsuffixkeyword like so:

如果 DataFrame 有一些共同的列名,则连接失败。解决它的最简单方法是包含一个lsuffixorrsuffix关键字,如下所示:

restaurant_review_frame.join(restaurant_ids_dataframe, on='business_id', how='left', lsuffix="_review")

This way, the columns have distinct names. The documentation addresses this very problem.

这样,列具有不同的名称。该文档解决了这个问题

Or, you could get around this by simply deleting the offending columns before you join. If, for example, the stars in restaurant_ids_dataframeare redundant to the stars in restaurant_review_frame, you could del restaurant_ids_dataframe['stars'].

或者,您可以通过在加入之前简单地删除有问题的列来解决此问题。例如,如果 中的星星对于restaurant_ids_dataframe中的星星是多余的restaurant_review_frame,您可以del restaurant_ids_dataframe['stars']

回答by Firas

In case anyone needs to try and merge two dataframes together on the index (instead of another column), this also works!

如果有人需要尝试在索引(而不是另一列)上合并两个数据帧,这也有效!

T1 and T2 are dataframes that have the same indices

T1 和 T2 是具有相同索引的数据帧

import pandas as pd
T1 = pd.merge(T1, T2, on=T1.index, how='outer')

P.S. I had to use merge because append would fill NaNs in unnecessarily.

PS 我不得不使用合并,因为 append 会不必要地填充 NaN。