pandas 熊猫连接失败

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35137952/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:36:06  来源:igfitidea点击:

Pandas concat failing

pythonpandas

提问by user308827

I am trying to concat dataframes based on the foll. 2 csv files:

我正在尝试根据 foll 连接数据帧。2个csv文件:

df_a: https://www.dropbox.com/s/slcu7o7yyottujl/df_current.csv?dl=0

df_a: https://www.dropbox.com/s/slcu7o7yyottujl/df_current.csv?dl=0

df_b: https://www.dropbox.com/s/laveuldraurdpu1/df_climatology.csv?dl=0

df_b: https://www.dropbox.com/s/laveuldraurdpu1/df_climatology.csv?dl=0

Both of these have the same number and names of columns. However, when I do this:

这两者具有相同的列数和名称。但是,当我这样做时:

pandas.concat([df_a, df_b])

I get the error:

我收到错误:

AssertionError: Number of manager items must equal union of block items
# manager items: 20, # tot_items: 21

How to fix this?

如何解决这个问题?

回答by phil_20686

I believe that this error occurs if the following two conditions are met:

我相信如果满足以下两个条件就会出现这个错误:

  1. The data frames have different columns. (i.e. (df1.columns == df2.columns)is False
  2. The columns has a repeated value.
  1. 数据框有不同的列。(即(df1.columns == df2.columns)False
  2. 列具有重复值。

Basically if you concatdataframes with columns [A,B,C]and [B,C,D]it can work out to make one series for each distinct column name. So if I try to join a third dataframe [B,B,C]it does not know which column to append and ends up with fewer distinct columns than it thinks it needs.

基本上,如果您concat使用带有列的数据框,[A,B,C]并且[B,C,D]它可以为每个不同的列名称制作一个系列。因此,如果我尝试加入第三个数据框,[B,B,C]它不知道要附加哪一列,最终得到的不同列比它认为需要的要少。

If your dataframes are such that df1.columns == df2.columnsthen it will work anyway. So you can join [B,B,C]to [B,B,C], but not to [C,B,B], as if the columns are identical it probably just uses the integer indexes or something.

如果您的数据帧是这样的,df1.columns == df2.columns那么无论如何它都会起作用。所以你可以加入[B,B,C][B,B,C],但不能加入到,[C,B,B]好像列是相同的,它可能只使用整数索引或其他东西。

回答by kmader

You can get around this issue with a 'manual' concatenation, in this case your

您可以通过“手动”连接来解决此问题,在这种情况下,您的

list_of_dfs = [df_a, df_b]

And instead of running

而不是跑步

giant_concat_df = pd.concat(list_of_dfs,0)

You can use turn all of the dataframes to a list of dictionaries and then make a new data frame from these lists (merged with chain)

您可以使用将所有数据框转换为字典列表,然后从这些列表中创建一个新数据框(与链合并)

from itertools import chain
list_of_dicts = [cur_df.T.to_dict().values() for cur_df in list_of_dfs]    
giant_concat_df = pd.DataFrame(list(chain(*list_of_dicts)))

回答by Karatheodory

Unfortunately, the source files are already unavailable, so I can't check my solution in your case. In my case the error occurred when:

不幸的是,源文件已经不可用,所以我无法在你的情况下检查我的解决方案。在我的情况下,错误发生在:

  1. Data frames have two columns with the same name (I've had IDand idcolumns, which I then converted to lower case, so they become the same)
  2. Value types of the same-named columns are different
  1. 数据框有两列同名(我有IDid列,然后我将其转换为小写,因此它们变得相同)
  2. 同名列的值类型不同

Here is an example which gives me the error in question:

这是一个示例,它给了我有问题的错误:

df1 = pd.DataFrame(data=[
    ['a', 'b', 'id', 1],
    ['a', 'b', 'id', 2]
], columns=['A', 'B', 'id', 'id'])

df2 = pd.DataFrame(data=[
    ['b', 'c', 'id', 1],
    ['b', 'c', 'id', 2]
], columns=['B', 'C', 'id', 'id'])
pd.concat([df1, df2])
>>> AssertionError: Number of manager items must equal union of block items
 # manager items: 4, # tot_items: 5

Removing / renaming one of the columns makes this code work.

删除/重命名其中一列使此代码起作用。