Python 熊猫三向连接列上的多个数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23668427/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas three-way joining multiple dataframes on columns
提问by lollercoaster
I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
我有 3 个 CSV 文件。每个都将第一列作为人的(字符串)名称,而每个数据框中的所有其他列都是该人的属性。
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
如何将所有三个 CSV 文档“连接”在一起以创建单个 CSV,其中每一行都具有该人的字符串名称的每个唯一值的所有属性?
The join()
function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.
join()
pandas 中的函数指定我需要一个多索引,但我对分层索引方案与基于单个索引进行连接有什么关系感到困惑。
采纳答案by Kit
Assumed imports:
假设进口:
import pandas as pd
John Galt's answeris basically a reduce
operation. If I have more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
John Galt 的回答基本上是一个reduce
操作。如果我有多个数据框,我会将它们放在这样的列表中(通过列表理解或循环或诸如此类的生成):
dfs = [df0, df1, df2, dfN]
Assuming they have some common column, like name
in your example, I'd do the following:
假设他们有一些共同的专栏,就像name
在你的例子中一样,我会做以下事情:
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
这样,您的代码应该可以处理您想要合并的任何数量的数据帧。
Edit August 1, 2016: For those using Python 3: reduce
has been moved into functools
. So to use this function, you'll first need to import that module:
2016 年 8 月 1 日编辑:对于使用 Python 3 的用户:reduce
已移入functools
. 因此,要使用此功能,您首先需要导入该模块:
from functools import reduce
回答by Zero
You could try this if you have 3 dataframes
如果你有 3 个数据框,你可以试试这个
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
或者,正如 cwharland 所提到的
df1.merge(df2,on='name').merge(df3,on='name')
回答by Guillaume Jacquenot
One does not need a multiindex to perform joinoperations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name')
for example)
一个不需要多索引来执行连接操作。一个只需要正确设置执行连接操作的索引列(df.set_index('Name')
例如哪个命令)
The join
operation is by default performed on index.
In your case, you just have to specify that the Name
column corresponds to your index.
Below is an example
该join
操作默认对索引执行。在您的情况下,您只需指定该Name
列对应于您的索引。下面是一个例子
A tutorialmay be useful.
甲教程可能是有用的。
# Simple example where dataframes index are the name on which to perform the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
回答by AlexG
This can also be done as follows for a list of dataframes df_list
:
对于数据框列表,这也可以按如下方式完成df_list
:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
或者如果数据帧在生成器对象中(例如为了减少内存消耗):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')
回答by rz1317
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
这是一种合并数据框字典同时保持列名与字典同步的方法。如果需要,它还填充缺失值:
This is the function to merge a dict of data frames
这是合并数据帧字典的函数
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
好的,让我们生成数据并测试一下:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
回答by Ted Petrou
This is an ideal situation for the join
method
这是该join
方法的理想情况
The join
method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
该join
方法正是为这些类型的情况而构建的。您可以将任意数量的 DataFrame 与其连接在一起。调用 DataFrame 与传递的 DataFrame 集合的索引连接。要使用多个 DataFrame,您必须将连接列放在索引中。
The code would look something like this:
代码看起来像这样:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With @zero's data, you could do this:
使用@zero 的数据,您可以这样做:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9
回答by Sylhare
There is another solution from the pandas documentation(that I don't see here),
pandas 文档中有另一个解决方案(我在这里没有看到),
using the .append
使用 .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True
is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
将ignore_index=True
被用来忽略所附数据帧的索引,在源一个可用下一个索引代替。
If there are different column names, Nan
will be introduced.
如果有不同的列名,Nan
会一一介绍。
回答by Igor Fobia
In python
3.6.3 with pandas
0.22.0 you can also use concat
as long as you set as index the columns you want to use for the joining
在python
3.6.3 和pandas
0.22.0 中concat
,只要将要用于连接的列设置为索引,您也可以使用
pd.concat(
(iDF.set_index('name') for iDF in [df1, df2, df3]),
axis=1, join='inner'
).reset_index()
where df1
, df2
, and df3
are defined as in John Galt's answer
其中df1
, df2
, 和df3
定义为John Galt 的回答
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
回答by Gil Baggio
Simple Solution:
简单的解决方案:
If the column names are similar:
如果列名相似:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
如果列名不同:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})