Pandas：合并多个数据框和控制列名称？

Question

提问by Richard

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?

我想将九个 Pandas 数据帧合并到一个数据帧中，对两列进行连接，控制列名。这可能吗？

I have nine datasets. All of them have the following columns:

我有九个数据集。它们都有以下列：

org, name, items,spend

I want to join them into a single dataframe with the following columns:

我想将它们加入一个包含以下列的数据框：

org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...

I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:

我一直在阅读有关合并和加入的文档。我目前可以像这样将两个数据集合并在一起：

ad = pd.DataFrame.merge(df_presents, df_trees,
                        on=['practice', 'name'],
                        suffixes=['_presents', '_trees'])

This works great, doing print list(aggregate_data.columns.values)shows me the following columns:

这很好用，这样做print list(aggregate_data.columns.values)向我展示了以下列：

[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]

But how can I do this for nine columns? mergeonly seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.

但是我怎么能对九列做到这一点呢？merge似乎一次只接受两个，如果我按顺序进行，我的列名最终会变得非常混乱。

Answer 1

回答by unutbu

You could use functools.reduceto iteratively apply pd.mergeto each of the DataFrames:

您可以使用functools.reduce迭代地应用于pd.merge每个数据帧：

result = functools.reduce(merge, dfs)

This is equivalent to

这相当于

result = dfs[0]
for df in dfs[1:]:
    result = merge(result, df)

To pass the on=['org', 'name']argument, you could use functools.partialdefine the merge function:

要传递on=['org', 'name']参数，您可以使用functools.partial定义合并函数：

merge = functools.partial(pd.merge, on=['org', 'name'])

Since specifying the suffixesparameter in functools.partialwould only allow one fixed choice of suffix, and since here we need a different suffix for each pd.mergecall, I think it would be easiest to prepare the DataFrames column names before calling pd.merge:

由于指定suffixes参数 infunctools.partial将只允许一个固定的后缀选择，并且因为在这里我们需要为每次pd.merge调用使用不同的后缀，我认为在调用之前准备 DataFrames 列名称是最简单的pd.merge：

for i, df in enumerate(dfs, start=1):
    df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, 
              inplace=True)

For example,

例如，

import pandas as pd
import numpy as np
import functools
np.random.seed(2015)

N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)), 
                    columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
    df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, 
              inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())

yields

产量

   org  name  items_df1  spend_df1  items_df2  spend_df2  items_df3  \
0    2     4          4          2          3          0          1   
1    2     4          4          2          3          0          1   
2    2     4          4          2          3          0          1   
3    2     4          4          2          3          0          1   
4    2     4          4          2          3          0          1   

   spend_df3  items_df4  spend_df4  items_df5  spend_df5  items_df6  \
0          3          1          0          1          0          4   
1          3          1          0          1          0          4   
2          3          1          0          1          0          4   
3          3          1          0          1          0          4   
4          3          1          0          1          0          4   

   spend_df6  items_df7  spend_df7  items_df8  spend_df8  items_df9  spend_df9  
0          3          4          1          3          0          1          2  
1          3          4          1          3          0          0          3  
2          3          4          1          3          0          0          0  
3          3          3          1          3          0          1          2  
4          3          3          1          3          0          0          3

Answer 2

回答by Zachary Cross

Would doing a big pd.concat()and then renaming all the columns work for you? Something like:

做一个大pd.concat()然后重命名所有列对你有用吗？就像是：

desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)


new_columns = ['org', 'name']
for i in range(num_dataframes):
    new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])

bid_df.columns = new_columns

This should give you columns like:

这应该为您提供如下列：

org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

Answer 3

回答by Cmdt.Ed

I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):

我有时也想要这个，但无法找到内置的Pandas方式。这是我的建议（以及我下次需要时的计划）：

Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict)and columns created in the previous step.

创建一个空字典，merge_dict。
循环遍历每个数据框所需的索引，并将所需的值添加到字典中，并将索引作为键。
生成一个新索引为sorted(merge_dict)。
通过循环merge_dict.items() 为每列生成一个新的数据列表。
使用index=sorted(merge_dict)上一步中创建的列创建一个新的数据框。

Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.

基本上，这有点像 SQL 中的散列连接。似乎是我能想到的最有效的方式，不应该花太长时间来编码。

Good luck.

祝你好运。

Pandas：合并多个数据框和控制列名称？

提问by Richard

回答by unutbu

回答by Zachary Cross

回答by Cmdt.Ed

相关推荐

最近更新

标签

Pandas：合并多个数据框和控制列名称？

提问by Richard

回答by unutbu

回答by Zachary Cross

回答by Cmdt.Ed

相关推荐

Pandas GroupBy：如何根据列获取前 n 个值

pandas “不能将float Nan转换为int”但没有Nan？

如何在 Pandas 中迭代 MultiIndex 级别？

pandas 将熊猫浮点系列转换为 int

相关推荐

最近更新

标签