Pandas:合并多个数据框和控制列名称?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34338831/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: merge multiple dataframes and control column names?
提问by Richard
I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
我想将九个 Pandas 数据帧合并到一个数据帧中,对两列进行连接,控制列名。这可能吗?
I have nine datasets. All of them have the following columns:
我有九个数据集。它们都有以下列:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
我想将它们加入一个包含以下列的数据框:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
我一直在阅读有关合并和加入的文档。我目前可以像这样将两个数据集合并在一起:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values)
shows me the following columns:
这很好用,这样做print list(aggregate_data.columns.values)
向我展示了以下列:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge
only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.
但是我怎么能对九列做到这一点呢?merge
似乎一次只接受两个,如果我按顺序进行,我的列名最终会变得非常混乱。
回答by unutbu
You could use functools.reduce
to iteratively apply pd.merge
to each of the DataFrames:
您可以使用functools.reduce
迭代地应用于pd.merge
每个数据帧:
result = functools.reduce(merge, dfs)
This is equivalent to
这相当于
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name']
argument, you could use functools.partial
define the merge function:
要传递on=['org', 'name']
参数,您可以使用functools.partial
定义合并函数:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes
parameter in functools.partial
would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge
call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge
:
由于指定suffixes
参数 infunctools.partial
将只允许一个固定的后缀选择,并且因为在这里我们需要为每次pd.merge
调用使用不同的后缀
,我认为在调用之前准备 DataFrames 列名称是最简单的pd.merge
:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
例如,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
产量
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3
回答by Zachary Cross
Would doing a big pd.concat()
and then renaming all the columns work for you? Something like:
做一个大pd.concat()
然后重命名所有列对你有用吗?就像是:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
这应该为您提供如下列:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8
回答by Cmdt.Ed
I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
我有时也想要这个,但无法找到内置的Pandas方式。这是我的建议(以及我下次需要时的计划):
- Create an empty dictionary,
merge_dict
. - Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
- Generate a new index as
sorted(merge_dict)
. - Generate a new list of data for each column by looping through merge_dict.items().
- Create a new data frame with
index=sorted(merge_dict)
and columns created in the previous step.
- 创建一个空字典,
merge_dict
。 - 循环遍历每个数据框所需的索引,并将所需的值添加到字典中,并将索引作为键。
- 生成一个新索引为
sorted(merge_dict)
。 - 通过循环merge_dict.items() 为每列生成一个新的数据列表。
- 使用
index=sorted(merge_dict)
上一步中创建的列创建一个新的数据框。
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
基本上,这有点像 SQL 中的散列连接。似乎是我能想到的最有效的方式,不应该花太长时间来编码。
Good luck.
祝你好运。