pandas 在pandas中,如何水平连接然后去除多余的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44545921/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
In pandas, how to concatenate horizontally and then remove the redundant columns
提问by Jun Jang
Say I have two dataframes.
假设我有两个数据框。
DF1: col1, col2, col3,
DF1: col1, col2, col3,
DF2: col2, col4, col5
DF2: col2, col4, col5
How do I concatenate the two dataframes horizontally and have the col1, col2, col3, col4, and col5? Right now, I am doing pd.concat([DF1, DF2], axis = 1) but it ends up having two col2's. Assuming all the values inside the two col2 are the same, I want to have only one columns.
如何水平连接两个数据帧并具有 col1、col2、col3、col4 和 col5?现在,我正在做 pd.concat([DF1, DF2],axis = 1) 但它最终有两个 col2。假设两个 col2 中的所有值都相同,我希望只有一列。
回答by Allen
Dropping duplicates should work. Because drop_duplicates only works with index, we need to transpose the DF to drop duplicates and transpose it back.
删除重复项应该有效。因为 drop_duplicates 仅适用于索引,我们需要转置 DF 以删除重复项并将其转回。
pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T
回答by jezrael
Use difference
for columns from DF2
which are not in DF1
and simple select them by []
:
使用difference
的列从DF2
它不是DF1
简单的通过选择它们[]
:
DF1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
DF2 = pd.DataFrame(columns=['col2', 'col4', 'col5'])
DF2 = DF2[DF2.columns.difference(DF1.columns)]
print (DF2)
Empty DataFrame
Columns: [col4, col5]
Index: []
print (pd.concat([DF1, DF2], axis = 1))
Empty DataFrame
Columns: [col1, col2, col3, col4, col5]
Index: []
Timings:
时间:
np.random.seed(123)
N = 1000
DF1 = pd.DataFrame(np.random.rand(N,3), columns=['col1', 'col2', 'col3'])
DF2 = pd.DataFrame(np.random.rand(N,3), columns=['col2', 'col4', 'col5'])
DF2['col2'] = DF1['col2']
In [408]: %timeit (pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T)
10 loops, best of 3: 122 ms per loop
In [409]: %timeit (pd.concat([DF1, DF2[DF2.columns.difference(DF1.columns)]], axis = 1))
1000 loops, best of 3: 979 μs per loop
N = 10000:
In [411]: %timeit (pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T)
1 loop, best of 3: 1.4 s per loop
In [412]: %timeit (pd.concat([DF1, DF2[DF2.columns.difference(DF1.columns)]], axis = 1))
1000 loops, best of 3: 1.12 ms per loop
回答by YOBEN_S
DF2.drop(DF2.columns[DF2.columns.isin(DF1.columns)],axis=1,inplace=True)
Then,
然后,
pd.concat([DF1, DF2], axis = 1)
回答by maria_g
To avoid duplication of the columns while joining two data frames use the ignore_index argument.
为了避免在连接两个数据框时出现重复的列,请使用 ignore_index 参数。
pd.concat([df1, df2], ignore_index=True, sort=False)
But use it only if wish to append them and ignore the fact that they may have overlapping indexes
但仅当希望附加它们并忽略它们可能具有重叠索引的事实时才使用它