Python 熊猫三向连接列上的多个数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23668427/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:17:52  来源:igfitidea点击:

pandas three-way joining multiple dataframes on columns

pythonpandasjoinmerge

提问by lollercoaster

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

我有 3 个 CSV 文件。每个都将第一列作为人的(字符串)名称,而每个数据框中的所有其他列都是该人的属性。

How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?

如何将所有三个 CSV 文档“连接”在一起以创建单个 CSV,其中每一行都具有该人的字符串名称的每个唯一值的所有属性?

The join()function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

join()pandas 中的函数指定我需要一个多索引,但我对分层索引方案与基于单个索引进行连接有什么关系感到困惑。

采纳答案by Kit

Assumed imports:

假设进口:

import pandas as pd

John Galt's answeris basically a reduceoperation. If I have more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):

John Galt 的回答基本上是一个reduce操作。如果我有多个数据框,我会将它们放在这样的列表中(通过列表理解或循环或诸如此类的生成):

dfs = [df0, df1, df2, dfN]

Assuming they have some common column, like namein your example, I'd do the following:

假设他们有一些共同的专栏,就像name在你的例子中一样,我会做以下事情:

df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

这样,您的代码应该可以处理您想要合并的任何数量的数据帧。

Edit August 1, 2016: For those using Python 3: reducehas been moved into functools. So to use this function, you'll first need to import that module:

2016 年 8 月 1 日编辑:对于使用 Python 3 的用户:reduce已移入functools. 因此,要使用此功能,您首先需要导入该模块:

from functools import reduce

回答by Zero

You could try this if you have 3 dataframes

如果你有 3 个数据框,你可以试试这个

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

或者,正如 cwharland 所提到的

df1.merge(df2,on='name').merge(df3,on='name')

回答by Guillaume Jacquenot

One does not need a multiindex to perform joinoperations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name')for example)

一个不需要多索引来执行连接操作。一个只需要正确设置执行连接操作的索引列(df.set_index('Name')例如哪个命令)

The joinoperation is by default performed on index. In your case, you just have to specify that the Namecolumn corresponds to your index. Below is an example

join操作默认对索引执行。在您的情况下,您只需指定该Name列对应于您的索引。下面是一个例子

A tutorialmay be useful.

教程可能是有用的。

# Simple example where dataframes index are the name on which to perform the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

回答by AlexG

This can also be done as follows for a list of dataframes df_list:

对于数据框列表,这也可以按如下方式完成df_list

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

或者如果数据帧在生成器对象中(例如为了减少内存消耗):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')

回答by rz1317

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

这是一种合并数据框字典同时保持列名与字典同步的方法。如果需要,它还填充缺失值:

This is the function to merge a dict of data frames

这是合并数据帧字典的函数

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

OK, lets generates data and test this:

好的,让我们生成数据并测试一下:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

回答by Ted Petrou

This is an ideal situation for the joinmethod

这是该join方法的理想情况

The joinmethod is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

join方法正是为这些类型的情况而构建的。您可以将任意数量的 DataFrame 与其连接在一起。调用 DataFrame 与传递的 DataFrame 集合的索引连接。要使用多个 DataFrame,您必须将连接列放在索引中。

The code would look something like this:

代码看起来像这样:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero's data, you could do this:

使用@zero 的数据,您可以这样做:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

回答by Sylhare

There is another solution from the pandas documentation(that I don't see here),

pandas 文档中有另一个解决方案(我在这里没有看到),

using the .append

使用 .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The ignore_index=Trueis used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

ignore_index=True被用来忽略所附数据帧的索引,在源一个可用下一个索引代替。

If there are different column names, Nanwill be introduced.

如果有不同的列名,Nan会一一介绍。

回答by Igor Fobia

In python3.6.3 with pandas0.22.0 you can also use concatas long as you set as index the columns you want to use for the joining

python3.6.3 和pandas0.22.0 中concat,只要将要用于连接的列设置为索引,您也可以使用

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

where df1, df2, and df3are defined as in John Galt's answer

其中df1, df2, 和df3定义为John Galt 的回答

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

回答by Gil Baggio

Simple Solution:

简单的解决方案:

If the column names are similar:

如果列名相似:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

如果列名不同:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

回答by decision_scientist_noah

The three dataframes are

这三个数据框是

enter image description here

在此处输入图片说明

enter image description here

在此处输入图片说明

Let's merge these frames using nested pd.merge

让我们使用嵌套的 pd.merge 合并这些框架

enter image description here

在此处输入图片说明

Here we go, we have our merged dataframe.

好了,我们有了合并的数据框。

Happy Analysis!!!

快乐分析!!!