合并多个 DataFrames Pandas

Question

提问by PEBKAC

This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.

这可能被认为是对各种方法的彻底解释的重复，但是由于数据帧数量较多，我似乎无法在那里找到解决我的问题的方法。

I have multipleData Frames (more than 10), each differing in one column VARX. This is just a quick and oversimplified example:

我有多个数据帧（超过 10 个），每个都在一列中不同VARX。这只是一个快速且过于简单的示例：

import pandas as pd

df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
       'VAR1': [38.196202, 38.198002, 38.200001],
       'profile': ['profile_1', 'profile_1','profile_1']})

df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
       'VAR2': [0.20440, 0.20442, 0.20446],
       'profile': ['profile_1', 'profile_1','profile_1']})

df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
       'VAR3': [15.1880, 15.1820, 15.1820],
       'profile': ['profile_1', 'profile_1','profile_1']})

Each dfhas same or different depths for the same profiles, so

df对于相同的剖面，每个具有相同或不同的深度，因此

I need to create a new DataFrame which would merge all separate ones, where the key columnsfor the operation are depthand profile, with allappearing depth values for each profile.

我需要创建一个新的 DataFrame，它将合并所有单独的 DataFrame，其中操作的关键列是depth和profile，每个配置文件都显示深度值。

The VARXvalue should be therefore NaNwhere there is no depth measurement of that variable for that profile.

VARX因此NaN，该值应该是没有对该轮廓的该变量进行深度测量的地方。

The result should be a thus a new, compressed DataFrame with all VARXas additional columns to the depthand profileones, something like this:

结果应该是一个新的、压缩的 DataFrame，所有的VARX列都作为depth和profile的附加列，如下所示：

name_profile    depth   VAR1        VAR2        VAR3
profile_1   0.500000    38.196202   NaN         NaN
profile_1   0.600000    38.198002   0.20440     NaN
profile_1   1.100000    NaN         0.20442     NaN
profile_1   1.200000    NaN         0.20446     15.1880
profile_1   1.300000    38.200001   NaN         15.1820
profile_1   1.400000    NaN         NaN         15.1820

Note that the actual number of profiles is much, much bigger.

请注意，配置文件的实际数量要大得多。

Any ideas?

有任何想法吗？

Answer 1

采纳答案by Parfait

Consider setting index on each data frame and then run the horizontal merge with pd.concat:

考虑在每个数据帧上设置索引，然后运行水平合并pd.concat：

dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]

print(pd.concat(dfs, axis=1).reset_index())
#      profile  depth       VAR1     VAR2    VAR3
# 0  profile_1    0.5  38.198002      NaN     NaN
# 1  profile_1    0.6  38.198002  0.20440     NaN
# 2  profile_1    1.1        NaN  0.20442     NaN
# 3  profile_1    1.2        NaN  0.20446  15.188
# 4  profile_1    1.3  38.200001      NaN  15.182
# 5  profile_1    1.4        NaN      NaN  15.182

Answer 2

回答by yatu

A simple way is with a combination of functools.partial/reduce.

一个简单的方法是结合functools.partial/reduce。

Firstly partialallows to "freeze" some portion of a function's arguments and/or keywords resulting in a new object with a simplified signature. Then with reducewe can apply cumulatively the new partial objectto the items of iterable (list of dataframes here):

首先partial允许“冻结”函数参数和/或关键字的某些部分，从而产生具有简化签名的新对象。然后reduce我们可以将新的部分对象累积应用于可迭代项（此处为数据帧列表）：

from functools import partial, reduce

dfs = [df1, df2, df3]
merge = partial(pd.merge, on=['depth', 'profile'], how='outer')
reduce(merge, dfs)

   depth       VAR1    profile     VAR2    VAR3
0    0.6  38.198002  profile_1  0.20440     NaN
1    0.6  38.198002  profile_1  0.20440     NaN
2    1.3  38.200001  profile_1      NaN  15.182
3    1.1        NaN  profile_1  0.20442     NaN
4    1.2        NaN  profile_1  0.20446  15.188
5    1.4        NaN  profile_1      NaN  15.182

Answer 3

回答by BlivetWidget

I would use append.

我会使用附加。

>>> df1.append(df2).append(df3).sort_values('depth')

        VAR1     VAR2    VAR3  depth    profile
0  38.196202      NaN     NaN    0.5  profile_1
1  38.198002      NaN     NaN    0.6  profile_1
0        NaN  0.20440     NaN    0.6  profile_1
1        NaN  0.20442     NaN    1.1  profile_1
2        NaN  0.20446     NaN    1.2  profile_1
0        NaN      NaN  15.188    1.2  profile_1
2  38.200001      NaN     NaN    1.3  profile_1
1        NaN      NaN  15.182    1.3  profile_1
2        NaN      NaN  15.182    1.4  profile_1

Obviously if you have a lot of dataframes, just make a list and loop through them.

显然，如果您有很多数据框，只需创建一个列表并遍历它们即可。

Answer 4

回答by SEpapoulis

Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.

为什么不连接所有数据帧，融合，然后使用您的 ID 对其进行改造？可能有更有效的方法来做到这一点，但这是有效的。

df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')

Where df_pivotwill be

df_pivot会在哪里

variable              VAR1     VAR2    VAR3
profile   depth                            
profile_1 0.5    38.196202      NaN     NaN
          0.6    38.198002  0.20440     NaN
          1.1          NaN  0.20442     NaN
          1.2          NaN  0.20446  15.188
          1.3    38.200001      NaN  15.182
          1.4          NaN      NaN  15.182

Answer 5

回答by heena bawa

You can also use:

您还可以使用：

dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
    df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')

   depth       VAR1    profile     VAR2    VAR3
0    0.5  38.196202  profile_1      NaN     NaN
1    0.6  38.198002  profile_1  0.20440     NaN
2    1.3  38.200001  profile_1      NaN  15.182
3    1.1        NaN  profile_1  0.20442     NaN
4    1.2        NaN  profile_1  0.20446  15.188
5    1.4        NaN  profile_1      NaN  15.182

合并多个 DataFrames Pandas

提问by PEBKAC

采纳答案by Parfait

回答by yatu

回答by BlivetWidget

回答by SEpapoulis

回答by heena bawa

相关推荐

最近更新

标签

合并多个 DataFrames Pandas

提问by PEBKAC

采纳答案by Parfait

回答by yatu

回答by BlivetWidget

回答by SEpapoulis

回答by heena bawa

相关推荐

Pandas，合并多列上的两个数据框，并乘以结果

pandas 0.24.1 关键错误：“[Index(['A' 'B'], dtype='object')] 均不在 [columns] 中”

具有不同列的 Pandas 连接数据帧：AttributeError: 'NoneType' 对象没有属性 'is_extension'

pandas 在 Python 中循环遍历数据帧的更优雅方式

相关推荐

最近更新

标签