pandas AssertionError的解决方案:在Dataframes列表上连接操作时get_concat_dtype中的dtype判定无效

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32488417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:52:07  来源:igfitidea点击:

Solution for AssertionError: invalid dtype determination in get_concat_dtype when concatenating operation on list of Dataframes

pythoncsvpandas

提问by ahlusar1989

I have a list of Dataframes that I am attempting to combine using the concatenation function.

我有一个数据帧列表,我试图使用连接函数来组合这些数据帧。

dataframe_lists = [df1, df2, df3]

result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)

The full traceback is:

完整的回溯是:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
      2 check(dataframe_lists)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    753                        verify_integrity=verify_integrity,
    754                        copy=copy)
--> 755     return op.get_result()
    756 
    757 

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    924 
    925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
    927             if not self.copy:
    928                 new_data._consolidate_inplace()

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4150         raise AssertionError("Concatenating join units along axis0")
   4151 
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   4153 
   4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   4139         return np.dtype('m8[ns]'), tslib.iNaT
   4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   4142 
   4143 

AssertionError: invalid dtype determination in get_concat_dtype

I believe that the error lies in the fact that one of the data frames is empty. I used the simple function checkto verify and return just the headers of the empty dataframe:

我认为错误在于其中一个数据帧为空。我使用简单的函数check来验证并仅返回空数据帧的标头:

  def check(list_of_df):

    headers = []
    for df in dataframe_lists:
        if df.empty is not True:
            continue
        else:  
            headers.append(df.columns)

    return headers

I am wondering if it is possible to use this function to, if in the case of an empty dataframe, return just that empty dataframe's headers and append it to the concatenated dataframe. The output would be a single row for the headers (and, in the case of a repeating column name, just a single instance of the header (as in the case of the concatenation function). I have two sample data sources, oneand twonon-empty data sets. Here is an empty dataframe.

我想知道是否可以使用此函数,如果在空数据帧的情况下,只返回该空数据帧的标头并将其附加到连接的数据帧。输出将是标题的单行(并且,在重复列名的情况下,只有标题的一个实例(如连接函数的情况)。我有两个示例数据源,一个两个非空数据集。这是一个空数据框

I would like to have the resulting concatenate have the column headers...

我想让结果连接具有列标题...

 'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

to have an empty dataframe's headers be appended in line with this row (if they are new).

将空数据帧的标题附加到该行(如果它们是新的)。

 'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

I welcome feedback on the best method to do this.

我欢迎有关执行此操作的最佳方法的反馈。

As the answer below details, this is a rather unexpected result:

正如下面的答案所详述,这是一个相当意外的结果:

Unfortunately, due to the sensitivity of this material, I cannot share the actual data. Leading up to what is presented in the gist is the following:

不幸的是,由于这些材料的敏感性,我无法分享实际数据。导致要点中呈现的内容如下:

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']

For each of the new data frames I then apply this logic:

对于每个新数据框,我然后应用此逻辑:

for column_name, column in A.transpose().iterrows():
    AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"

When I perform the bound method on an empty dataframe A:

当我在空数据帧 A 上执行绑定方法时:

AColumns.count

This is the output:

这是输出:

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>

Finally, I imported the CSV with the following:

最后,我使用以下内容导入了 CSV:

data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)

I am not certain what else I can provide. The concatenation method works with all other data frames that are needed to meet a requirement. I have also looked at the Pandas internals.py and the full trace. Either I have too many columns with NaN, duplicate column names or mixed dtypes (the latter being the least likely culprit).

我不确定我还能提供什么。串联方法适用于满足要求所需的所有其他数据帧。我还查看了 Pandas internals.py 和完整跟踪。要么我有太多带有 NaN 的列、重复的列名或混合 dtypes(后者是最不可能的罪魁祸首)。

Thank you again for your guidance.

再次感谢您的指导。

回答by remi

During one of our projects we experienced the same error. After debugging we found the problem. One of our dataframes had 2 columns with the same name. After renaming one of the columns our problem was solved.

在我们的一个项目中,我们遇到了同样的错误。经过调试,我们发现了问题。我们的数据框之一有 2 列同名。重命名其中一列后,我们的问题就解决了。

回答by Abramodj

This often means that you have two columns with the same names in one of the dataframes.

这通常意味着您在其中一个数据框中有两列具有相同名称。

You can check if this is the case by looking at the output of

您可以通过查看输出来检查是否是这种情况

len(df.columns) > len(np.unique(df.columns))

for each dataframe dfthat you are trying to concatenate.

对于df您尝试连接的每个数据帧。

You can identify the culprit columns through using Countersee for example:

例如,您可以通过使用Countersee来识别罪魁祸首列:

from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]

回答by Vincent Claes

I have noticed that it is possible, when concatenating or appending with an empty dataframe. Try the following example:

我注意到在连接或附加空数据帧时这是可能的。试试下面的例子:

    my_headers = ['A,' 'B', 'C']

I have a DataFrame df_input with values and where the headers are not necessarily the same as my_headers.

我有一个带有值的 DataFrame df_input,其中的标头不一定与my_headers.

    dictionary = {element:None for element in my_headers}
    df = DataFrame(dictionary, index=[0])
    #append the two dataframes
    df_final = df_input.append(df)

回答by maxymoo

I can't reproduce your error, it works ok for me:

我无法重现您的错误,它对我来说没问题:

df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()

Out[68]: 
   'B'  'C'  'D'  'E'  'F'  'G'  'A'  AT  AccountNum  AcctType ...   0  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
1  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
2  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    

   ToAccountNum  ToAccountT  TransferAmount  TransferMade  TransferTimestamp  0           NaN         NaN               4          True      1/7/2000 0:00   
1           NaN         NaN               4          True      1/8/2000 0:00   
2           NaN         NaN               6          True      1/9/2000 0:00   
3           NaN         NaN               6          True     1/10/2000 0:00   
4           NaN         NaN               0         False     1/11/2000 0:00   

   Ttype  Unnamed: 0  WA   WC  Zip  
0      D           4 NaN  NaN  NaN  
1      D           5 NaN  NaN  NaN  
2      D          13 NaN  NaN  NaN  
3      D          14 NaN  NaN  NaN  
4      T          25 NaN  NaN  NaN  

[5 rows x 41 columns]