Python：熊猫合并多个数据帧

Question

提问by Vasco Ferreira

I have diferent dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date'), to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date'), however it becomes really complex and unreadable to do it with multiple dataframes.

我有不同的数据框，需要根据日期列将它们合并在一起。如果我只有两个数据帧，我可以使用df1.merge(df2, on='date'), 用三个数据帧来做，我使用df1.merge(df2.merge(df3, on='date'), on='date')，但是用多个数据帧来做它变得非常复杂和不可读。

All dataframes have one column in common -date, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.

所有数据帧都有一个共同的列 - date，但它们的行数和列数都不相同，我只需要其中每个日期对每个数据帧都相同的那些行。

So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?

所以，我正在尝试编写一个递归函数，该函数返回一个包含所有数据的数据帧，但它不起作用。那么我应该如何合并多个数据帧？

I tried diferent ways and got errors like out of range, keyerror 0/1/2/3and can not merge DataFrame with instance of type <class 'NoneType'>.

我尝试了不同的方法并得到了诸如out of range,keyerror 0/1/2/3和之类的错误can not merge DataFrame with instance of type <class 'NoneType'>。

This is the script I wrote:

这是我写的脚本：

dfs = [df1, df2, df3] # list of dataframes

def mergefiles(dfs, countfiles, i=0):
    if i == (countfiles - 2): # it gets to the second to last and merges it with the last
        return

    dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
    return dfm

print(mergefiles(dfs, len(dfs)))

An example: df_1:

一个例子：df_1：

May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%

df_2:

df_2：

May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%

df_3:

df_3：

May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%

Expected merge result:

预期合并结果：

May 15, 2017;  1,901.00;0.1%;  2,902.00;1000000;0.2%;   3,903.00;2000000;0.3%

Answer 1

回答by everestial007

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.

下面是在不涉及复杂查询的情况下合并多个数据帧的最干净、最易于理解的方法。

Just simply merge with DATEas the index and merge using OUTERmethod (to get all the data).

只需简单地与DATE合并作为索引并使用OUTER方法合并（以获取所有数据）。

import pandas as pd
from functools import reduce

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

Now, basically load all the files you have as data frame into a list. And, then merge the files using mergeor reducefunction.

现在，基本上将您拥有的所有文件作为数据框加载到列表中。然后，使用merge或reduce函数合并文件。

# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]

Note: you can add as many data-frames inside the above list.This is the good part about this method. No complex queries involved.

注意：您可以在上面的列表中添加尽可能多的数据框。这是这种方法的优点。不涉及复杂的查询。

To keep the values that belong to the same date you need to merge it on the DATE

要保留属于同一日期的值，您需要将其合并到 DATE

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames)

# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames).fillna('void')

Now, the output will the values from the same date on the same lines.
You can fill the non existing data from different frames for different columns using fillna().

现在，输出将在同一行上来自同一日期的值。
您可以使用fillna() 为不同的列填充来自不同帧的不存在的数据。

Then write the merged data to the csv file if desired.

如果需要，然后将合并的数据写入 csv 文件。

pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)

This should give you

这应该给你

DATE VALUE1 VALUE2 VALUE3 ....

Answer 2

回答by dannyeuu

Looks like the data has the same columns, so you can:

看起来数据具有相同的列，因此您可以：

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

merged_df = pd.concat([df1, df2])

Answer 3

回答by Ismail Hachimi

functools.reduceand pd.concatare good solutions but in term of execution time pd.concat is the best.

functools.reduce和pd.concat是很好的解决方案，但在执行时间方面 pd.concat 是最好的。

from functools import reduce
import pandas as pd

dfs = [df1, df2, df3, ...]
nan_value = 0

# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)

# solution 2
result_2 = reduce(lambda left,right: pd.merge(df_left, df_right, 
                                              left_index=True, right_index=True, 
                                              how='outer'), 
                  dfs).fillna(nan_value)

Answer 4

回答by Allen Wang

@dannyeuu's answer is correct. pd.concat naturally does a join on index columns, if you set the axis option to 1. The default is an outer join, but you can specify inner join too. Here is an example:

@dannyeuu 的回答是正确的。如果您将轴选项设置为 1，pd.concat 自然会在索引列上进行连接。默认是外连接，但您也可以指定内连接。下面是一个例子：

x = pd.DataFrame({'a': [2,4,3,4,5,2,3,4,2,5], 'b':[2,3,4,1,6,6,5,2,4,2], 'val': [1,4,4,3,6,4,3,6,5,7], 'val2': [2,4,1,6,4,2,8,6,3,9]})
x.set_index(['a','b'], inplace=True)
x.sort_index(inplace=True)

y = x.__deepcopy__()
y.loc[(14,14),:] = [3,1]
y['other']=range(0,11)

y.sort_values('val', inplace=True)

z = x.__deepcopy__()
z.loc[(15,15),:] = [3,4]
z['another']=range(0,22,2)
z.sort_values('val2',inplace=True)


pd.concat([x,y,z],axis=1)

Answer 5

回答by jezrael

There are 2 solutions for this, but it return all columns separately:

有两种解决方案，但它分别返回所有列：

import functools

dfs = [df1, df2, df3]

df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
          date     a_x   b_x       a_y      b_y   c_x         a        b   c_y
0  May 15,2017  900.00  0.2%  1,900.00  1000000  0.2%  2,900.00  2000000  0.2%

k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
                0_a   0_b       1_a      1_b   1_c       2_a      2_b   2_c
date                                                                       
May 15,2017  900.00  0.2%  1,900.00  1000000  0.2%  2,900.00  2000000  0.2%

Answer 6

回答by Kaibo

Look at this pandas three-way joining multiple dataframes on columns

看看这个熊猫三向加入列上的多个数据帧

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

Answer 7

回答by zipa

If you are filtering by common date this will return it:

如果您按共同日期过滤，这将返回它：

dfs = [df1, df2, df3]
checker = dfs[-1]
check = set(checker.loc[:, 0])

for df in dfs[:-1]:
    check = check.intersection(set(df.loc[:, 0]))

print(checker[checker.loc[:, 0].isin(check)])

Answer 8

回答by Vasco Ferreira

Thank you for your help @jezrael, @zipaand @everestial007, both answers are what I need. If I wanted to make a recursive, this would also work as intended:

感谢您的帮助@jezrael、@zipa和@everestial007，这两个答案都是我所需要的。如果我想进行递归，这也可以按预期工作：

def mergefiles(dfs=[], on=''):
    """Merge a list of files based on one column"""
    if len(dfs) == 1:
         return "List only have one element."

    elif len(dfs) == 2:
        df1 = dfs[0]
        df2 = dfs[1]
        df = df1.merge(df2, on=on)
        return df

    # Merge the first and second datafranes into new dataframe
    df1 = dfs[0]
    df2 = dfs[1]
    df = dfs[0].merge(dfs[1], on=on)

    # Create new list with merged dataframe
    dfl = []
    dfl.append(df)

    # Join lists
    dfl = dfl + dfs[2:] 
    dfm = mergefiles(dfl, on)
    return dfm

Python：熊猫合并多个数据帧

提问by Vasco Ferreira

回答by everestial007

回答by dannyeuu

回答by Ismail Hachimi

回答by Allen Wang

回答by jezrael

回答by Kaibo

回答by zipa

回答by Vasco Ferreira

相关推荐

最近更新

标签

Python：熊猫合并多个数据帧

提问by Vasco Ferreira

回答by everestial007

回答by dannyeuu

回答by Ismail Hachimi

回答by Allen Wang

回答by jezrael

回答by Kaibo

回答by zipa

回答by Vasco Ferreira

相关推荐

Python 从熊猫数据帧单元格中的凌乱字符串中删除换行符？

Python 无法获取窗口，中止

Python 如何将 conda 环境“克隆”到根环境中？

Python 我不能使用 pip (Windows)

相关推荐

最近更新

标签