Python:熊猫合并多个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44327999/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: pandas merge multiple dataframes
提问by Vasco Ferreira
I have diferent dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date')
, to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date')
, however it becomes really complex and unreadable to do it with multiple dataframes.
我有不同的数据框,需要根据日期列将它们合并在一起。如果我只有两个数据帧,我可以使用df1.merge(df2, on='date')
, 用三个数据帧来做,我使用df1.merge(df2.merge(df3, on='date'), on='date')
,但是用多个数据帧来做它变得非常复杂和不可读。
All dataframes have one column in common -date
, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.
所有数据帧都有一个共同的列 - date
,但它们的行数和列数都不相同,我只需要其中每个日期对每个数据帧都相同的那些行。
So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?
所以,我正在尝试编写一个递归函数,该函数返回一个包含所有数据的数据帧,但它不起作用。那么我应该如何合并多个数据帧?
I tried diferent ways and got errors like out of range
, keyerror 0/1/2/3
and can not merge DataFrame with instance of type <class 'NoneType'>
.
我尝试了不同的方法并得到了诸如out of range
,keyerror 0/1/2/3
和 之类的错误can not merge DataFrame with instance of type <class 'NoneType'>
。
This is the script I wrote:
这是我写的脚本:
dfs = [df1, df2, df3] # list of dataframes
def mergefiles(dfs, countfiles, i=0):
if i == (countfiles - 2): # it gets to the second to last and merges it with the last
return
dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
return dfm
print(mergefiles(dfs, len(dfs)))
An example: df_1:
一个例子:df_1:
May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%
df_2:
df_2:
May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%
df_3:
df_3:
May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%
Expected merge result:
预期合并结果:
May 15, 2017; 1,901.00;0.1%; 2,902.00;1000000;0.2%; 3,903.00;2000000;0.3%
回答by everestial007
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
下面是在不涉及复杂查询的情况下合并多个数据帧的最干净、最易于理解的方法。
Just simply merge with DATEas the index and merge using OUTERmethod (to get all the data).
只需简单地与DATE合并作为索引并使用OUTER方法合并(以获取所有数据)。
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
现在,基本上将您拥有的所有文件作为数据框加载到列表中。然后,使用merge
或reduce
函数合并文件。
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list.This is the good part about this method. No complex queries involved.
注意:您可以在上面的列表中添加尽可能多的数据框。这是这种方法的优点。不涉及复杂的查询。
To keep the values that belong to the same date you need to merge it on the DATE
要保留属于同一日期的值,您需要将其合并到 DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
- Now, the output will the values from the same date on the same lines.
- You can fill the non existing data from different frames for different columns using fillna().
- 现在,输出将在同一行上来自同一日期的值。
- 您可以使用fillna() 为不同的列填充来自不同帧的不存在的数据。
Then write the merged data to the csv file if desired.
如果需要,然后将合并的数据写入 csv 文件。
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
这应该给你
DATE VALUE1 VALUE2 VALUE3 ....
DATE VALUE1 VALUE2 VALUE3 ....
回答by dannyeuu
Looks like the data has the same columns, so you can:
看起来数据具有相同的列,因此您可以:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.concat([df1, df2])
回答by Ismail Hachimi
functools.reduceand pd.concatare good solutions but in term of execution time pd.concat is the best.
functools.reduce和pd.concat是很好的解决方案,但在执行时间方面 pd.concat 是最好的。
from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, ...]
nan_value = 0
# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)
# solution 2
result_2 = reduce(lambda left,right: pd.merge(df_left, df_right,
left_index=True, right_index=True,
how='outer'),
dfs).fillna(nan_value)
回答by Allen Wang
@dannyeuu's answer is correct. pd.concat naturally does a join on index columns, if you set the axis option to 1. The default is an outer join, but you can specify inner join too. Here is an example:
@dannyeuu 的回答是正确的。如果您将轴选项设置为 1,pd.concat 自然会在索引列上进行连接。默认是外连接,但您也可以指定内连接。下面是一个例子:
x = pd.DataFrame({'a': [2,4,3,4,5,2,3,4,2,5], 'b':[2,3,4,1,6,6,5,2,4,2], 'val': [1,4,4,3,6,4,3,6,5,7], 'val2': [2,4,1,6,4,2,8,6,3,9]})
x.set_index(['a','b'], inplace=True)
x.sort_index(inplace=True)
y = x.__deepcopy__()
y.loc[(14,14),:] = [3,1]
y['other']=range(0,11)
y.sort_values('val', inplace=True)
z = x.__deepcopy__()
z.loc[(15,15),:] = [3,4]
z['another']=range(0,22,2)
z.sort_values('val2',inplace=True)
pd.concat([x,y,z],axis=1)
回答by jezrael
There are 2 solutions for this, but it return all columns separately:
有两种解决方案,但它分别返回所有列:
import functools
dfs = [df1, df2, df3]
df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
date a_x b_x a_y b_y c_x a b c_y
0 May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
0_a 0_b 1_a 1_b 1_c 2_a 2_b 2_c
date
May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
回答by Kaibo
Look at this pandas three-way joining multiple dataframes on columns
看看这个熊猫三向加入列上的多个数据帧
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
回答by zipa
If you are filtering by common date this will return it:
如果您按共同日期过滤,这将返回它:
dfs = [df1, df2, df3]
checker = dfs[-1]
check = set(checker.loc[:, 0])
for df in dfs[:-1]:
check = check.intersection(set(df.loc[:, 0]))
print(checker[checker.loc[:, 0].isin(check)])
回答by Vasco Ferreira
Thank you for your help @jezrael, @zipaand @everestial007, both answers are what I need. If I wanted to make a recursive, this would also work as intended:
感谢您的帮助@jezrael、@zipa和@everestial007,这两个答案都是我所需要的。如果我想进行递归,这也可以按预期工作:
def mergefiles(dfs=[], on=''):
"""Merge a list of files based on one column"""
if len(dfs) == 1:
return "List only have one element."
elif len(dfs) == 2:
df1 = dfs[0]
df2 = dfs[1]
df = df1.merge(df2, on=on)
return df
# Merge the first and second datafranes into new dataframe
df1 = dfs[0]
df2 = dfs[1]
df = dfs[0].merge(dfs[1], on=on)
# Create new list with merged dataframe
dfl = []
dfl.append(df)
# Join lists
dfl = dfl + dfs[2:]
dfm = mergefiles(dfl, on)
return dfm