pandas 连接具有不同列顺序的数据框

Question

提问by Santi Pe?ate-Vera

I am parsing data from excel files and the columns of the resulting DataFramemay or may not align to a base DataFramewhere I want to stack several parsed DataFrame.

我正在解析 excel 文件中的数据，结果的列DataFrame可能会或可能不会与 DataFrame我想要堆叠多个已解析 DataFrame.

Lets call the DataFrameI parse from data A, and the base DataFramedf_A.

让我们调用 DataFrameI parse from dataA和 base DataFramedf_A。

I read an excel shee resulting in A=

我读了一个excel表，结果 A=

Index                    AGUB  AGUG   MUEB   MUEB    SIL    SIL   SILB   SILB
2012-01-01 00:00:00      0.00     0   0.00  50.78   0.00   0.00   0.00   0.00
2012-01-01 01:00:00      0.00     0   0.00  53.15   0.00  53.15   0.00   0.00
2012-01-01 02:00:00      0.00     0   0.00   0.00  53.15  53.15  53.15  53.15
2012-01-01 03:00:00      0.00     0   0.00   0.00   0.00  55.16   0.00   0.00
2012-01-01 04:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 05:00:00     48.96     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 06:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 07:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 08:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 09:00:00     52.28     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 10:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 11:00:00     36.93     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 12:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 13:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00  50.00
2012-01-01 14:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00  34.01
2012-01-01 15:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 16:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 17:00:00     53.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 18:00:00      0.00    75   0.00  75.00   0.00  75.00   0.00   0.00
2012-01-01 19:00:00      0.00    70   0.00  70.00   0.00   0.00   0.00   0.00
2012-01-01 20:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 21:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 22:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 23:00:00      0.00     0  53.45  53.45   0.00   0.00   0.00   0.00

I create the base dataframe:

我创建了基本数据框：

units = ['MUE', 'MUEB', 'SIL', 'SILB', 'AGUG', 'AGUB', 'MUEBP', 'MUELP']
df_A = pd.DataFrame(columns=units)
df_A = pd.concat([df_A, A], axis=0)

Usually with concatif Ahad less columns than df_Ait'll be fine, but in this case the only difference in the columns is the order. the concatenation leads to the following error:

通常concat如果A列数少于df_A它就可以了，但在这种情况下，列中唯一的区别是顺序。串联导致以下错误：

ValueError: Plan shapes are not aligned

ValueError：计划形状未对齐

I'd like to know how to concatenate the two dataframes with the column order given by df_A.

我想知道如何将两个数据框与df_A.

Answer 1

回答by Thomas Kimber

I've tried this and it doesn't matter whether there are more columns in the source, or target defined DataFrame - either way, the result is a dataframe that consists of a union of all supplied columns (with empty columns specified in the target, but not populated by the source populated with NaN).

我已经试过了，源或目标定义的 DataFrame 中是否有更多列都没有关系 - 无论哪种方式，结果都是一个由所有提供的列的联合组成的数据帧（在目标中指定了空列），但未由填充有NaN)的源填充。

Where I have been able to reproduce your error is where the column names in either the source or target dataframe include a duplicate name (or empty column names).

我能够重现您的错误的地方是源或目标数据框中的列名称包含重复名称（或空列名称）。

In your example, various columns appear more than once in your source file. I don't think concat copes very well with these kinds of duplicate columns.

在您的示例中，不同的列在源文件中出现多次。我认为 concat 不能很好地处理这些类型的重复列。

import pandas as pd
s1 = [0,1,2,3,4,5]
s2 = [0,0,0,0,1,1]
A = pd.DataFrame([s2,s1],columns=['A','B','C','D','E','F'])

Resulting in:

导致：

A B C D E F
-----------
0 0 0 0 1 1 
0 1 2 3 4 5

Take a subset of columns and use them to create a new dataframe called B

获取列的子集并使用它们创建一个名为 B 的新数据框

B = A[['A','C','E']]

 

A C E
-----
0 0 1 
0 2 4

Create a new empty target dataframe

创建一个新的空目标数据框

col_names = ['D','A','C','B']
Z = pd.DataFrame(columns=col_names)

D A C B
-------

And concatenate the two:

并将两者连接起来：

Z = pd.concat([B,Z],axis=0)

A  C  D   E
0  0  NaN 1 
0  2  NaN 4

Works fine!

工作正常！

But if I recreate the empty dataframe using columns as so:

但是，如果我使用列重新创建空数据框，如下所示：

col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)

    D A C D

And try to concatenate:

并尝试连接：

col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)

Then I get the error you describe.

然后我得到你描述的错误。

Answer 2

回答by Def_Os

It's because of the duplicate columns in the data (SIL). See: Pandas concat gives error ValueError: Plan shapes are not aligned

这是因为数据 ( SIL)中有重复的列。请参阅：Pandas concat 给出错误 ValueError：计划形状未对齐

pandas 连接具有不同列顺序的数据框

提问by Santi Pe?ate-Vera

回答by Thomas Kimber

回答by Def_Os

相关推荐

最近更新

标签

pandas 连接具有不同列顺序的数据框

提问by Santi Pe?ate-Vera

回答by Thomas Kimber

回答by Def_Os

相关推荐

在没有“零”值的情况下计算 Pandas 中的最小值？

迭代创建 Pandas DataFrame

Pandas：如何将 DataFrame 中的列表列按行与 Pandas 进行比较（不是 for 循环）？

pandas 如何仅将参数传递给 scikit learn 中管道对象的一部分？

相关推荐

最近更新

标签