Python Concat DataFrame Reindexing 仅对唯一值的 Index 对象有效

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35084071/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:58:27  来源:igfitidea点击:

Concat DataFrame Reindexing only valid with uniquely valued Index objects

pythonnumpypandas

提问by noidea

I am trying to concat the following:

我正在尝试连接以下内容:

df1

df1

    price   side    timestamp
timestamp           
2016-01-04 00:01:15.631331072   0.7286  2   1451865675631331
2016-01-04 00:01:15.631399936   0.7286  2   1451865675631400
2016-01-04 00:01:15.631860992   0.7286  2   1451865675631861
2016-01-04 00:01:15.631866112   0.7286  2   1451865675631866

and

df2

df2

    bid bid_size    offer   offer_size
timestamp               
2016-01-04 00:00:31.331441920   0.7284  4000000 0.7285  1000000
2016-01-04 00:00:53.631324928   0.7284  4000000 0.7290  4000000
2016-01-04 00:01:03.131234048   0.7284  5000000 0.7286  4000000
2016-01-04 00:01:12.131444992   0.7285  1000000 0.7286  4000000
2016-01-04 00:01:15.631364096   0.7285  4000000 0.7290  4000000

With

 data = pd.concat([df1,df2], axis=1)  

But I get the follwing output:

但我得到以下输出:

InvalidIndexError                         Traceback (most recent call last)
<ipython-input-38-2e88458f01d7> in <module>()
----> 1 data = pd.concat([df1,df2], axis=1)
      2 data = data.fillna(method='pad')
      3 data = data.fillna(method='bfill')
      4 data['timestamp'] =  data.index.values#converting to datetime
      5 data['timestamp'] = pd.to_datetime(data['timestamp'])#converting to datetime

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    810                        keys=keys, levels=levels, names=names,
    811                        verify_integrity=verify_integrity,
--> 812                        copy=copy)
    813     return op.get_result()
    814 

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    947         self.copy = copy
    948 
--> 949         self.new_axes = self._get_new_axes()
    950 
    951     def get_result(self):

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_new_axes(self)
   1013                 if i == self.axis:
   1014                     continue
-> 1015                 new_axes[i] = self._get_comb_axis(i)
   1016         else:
   1017             if len(self.join_axes) != ndim - 1:

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_comb_axis(self, i)
   1039                 raise TypeError("Cannot concatenate list of %s" % types)
   1040 
-> 1041         return _get_combined_index(all_indexes, intersect=self.intersect)
   1042 
   1043     def _get_concat_axis(self):

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _get_combined_index(indexes, intersect)
   6120             index = index.intersection(other)
   6121         return index
-> 6122     union = _union_indexes(indexes)
   6123     return _ensure_index(union)
   6124 

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _union_indexes(indexes)
   6149 
   6150         if hasattr(result, 'union_many'):
-> 6151             return result.union_many(indexes[1:])
   6152         else:
   6153             for other in indexes[1:]:

/usr/local/lib/python2.7/site-packages/pandas/tseries/index.pyc in union_many(self, others)
    959             else:
    960                 tz = this.tz
--> 961                 this = Index.union(this, other)
    962                 if isinstance(this, DatetimeIndex):
    963                     this.tz = tz

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in union(self, other)
   1553                 result.extend([x for x in other._values if x not in value_set])
   1554         else:
-> 1555             indexer = self.get_indexer(other)
   1556             indexer, = (indexer == -1).nonzero()
   1557 

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
   1890 
   1891         if not self.is_unique:
-> 1892             raise InvalidIndexError('Reindexing only valid with uniquely'
   1893                                     ' valued Index objects')
   1894 

InvalidIndexError: Reindexing only valid with uniquely valued Index objects  

I have removed additional columns and removed duplicates and NA where there could be a conflict - but I simply do not know whats wrong.

我已经删除了额外的列并删除了可能存在冲突的重复项和 NA - 但我根本不知道出了什么问题。

Please help Thanks

请帮忙谢谢

采纳答案by unutbu

pd.concatrequires that the indicesbe unique. To remove rows with duplicate indices, use

pd.concat要求索引是唯一的。要删除具有重复索引的行,请使用

df = df.loc[~df.index.duplicated(keep='first')]


import pandas as pd
from pandas import Timestamp

df1 = pd.DataFrame(
    {'price': [0.7286, 0.7286, 0.7286, 0.7286],
     'side': [2, 2, 2, 2],
     'timestamp': [1451865675631331, 1451865675631400,
                  1451865675631861, 1451865675631866]},
    index=pd.DatetimeIndex(['2000-1-1', '2000-1-1', '2001-1-1', '2002-1-1']))


df2 = pd.DataFrame(
    {'bid': [0.7284, 0.7284, 0.7284, 0.7285, 0.7285],
     'bid_size': [4000000, 4000000, 5000000, 1000000, 4000000],
     'offer': [0.7285, 0.729, 0.7286, 0.7286, 0.729],
     'offer_size': [1000000, 4000000, 4000000, 4000000, 4000000]},
    index=pd.DatetimeIndex(['2000-1-1', '2001-1-1', '2002-1-1', '2003-1-1', '2004-1-1']))


df1 = df1.loc[~df1.index.duplicated(keep='first')]
#              price  side         timestamp
# 2000-01-01  0.7286     2  1451865675631331
# 2001-01-01  0.7286     2  1451865675631861
# 2002-01-01  0.7286     2  1451865675631866

df2 = df2.loc[~df2.index.duplicated(keep='first')]
#                bid  bid_size   offer  offer_size
# 2000-01-01  0.7284   4000000  0.7285     1000000
# 2001-01-01  0.7284   4000000  0.7290     4000000
# 2002-01-01  0.7284   5000000  0.7286     4000000
# 2003-01-01  0.7285   1000000  0.7286     4000000
# 2004-01-01  0.7285   4000000  0.7290     4000000

result = pd.concat([df1, df2], axis=0)
print(result)
               bid  bid_size   offer  offer_size   price  side     timestamp
2000-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2001-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2002-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2000-01-01  0.7284   4000000  0.7285     1000000     NaN   NaN           NaN
2001-01-01  0.7284   4000000  0.7290     4000000     NaN   NaN           NaN
2002-01-01  0.7284   5000000  0.7286     4000000     NaN   NaN           NaN
2003-01-01  0.7285   1000000  0.7286     4000000     NaN   NaN           NaN
2004-01-01  0.7285   4000000  0.7290     4000000     NaN   NaN           NaN


Note there is also pd.join, which can join DataFrames based on their indices, and handle non-unique indices based on the howparameter. Rows with duplicate index are not removed.

请注意,还有pd.join,它可以根据索引加入数据帧,并根据how参数处理非唯一索引。不删除具有重复索引的行。

In [94]: df1.join(df2)
Out[94]: 
             price  side         timestamp     bid  bid_size   offer  \
2000-01-01  0.7286     2  1451865675631331  0.7284   4000000  0.7285   
2000-01-01  0.7286     2  1451865675631400  0.7284   4000000  0.7285   
2001-01-01  0.7286     2  1451865675631861  0.7284   4000000  0.7290   
2002-01-01  0.7286     2  1451865675631866  0.7284   5000000  0.7286   

            offer_size  
2000-01-01     1000000  
2000-01-01     1000000  
2001-01-01     4000000  
2002-01-01     4000000  

In [95]: df1.join(df2, how='outer')
Out[95]: 
             price  side     timestamp     bid  bid_size   offer  offer_size
2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
2001-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7290     4000000
2002-01-01  0.7286     2  1.451866e+15  0.7284   5000000  0.7286     4000000
2003-01-01     NaN   NaN           NaN  0.7285   1000000  0.7286     4000000
2004-01-01     NaN   NaN           NaN  0.7285   4000000  0.7290     4000000

回答by Nicholas Morley

You can mitigate this error without having to change your data or remove duplicates. Just create a new index with DataFrame.reset_index:

您无需更改数据或删除重复项即可缓解此错误。只需使用DataFrame.reset_index创建一个新索引:

df = df.reset_index()

The old index is kept as a column in your dataframe, but if you don't need it you can do:

旧索引在数据框中保留为一列,但如果您不需要它,您可以执行以下操作:

df = df.reset_index(drop=True)

Some prefer:

有些人更喜欢:

df.reset_index(inplace=True, drop=True)

回答by May Pilijay El

Another thing that might throw this type of errors is when you have a column with a unique value inside (entropy of 0). In this case, you can either inject small Gaussian noise in that column or remove it completely.

可能引发此类错误的另一件事是,当您有一列内部具有唯一值(熵为 0)时。在这种情况下,您可以在该列中注入小的高斯噪声或将其完全去除。

回答by Yapi

best solution from this page: https://pandas.pydata.org/pandas-docs/version/0.20/merging.html

此页面的最佳解决方案:https: //pandas.pydata.org/pandas-docs/version/0.20/merging.html

df = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
df = pd.concat([df1, df2], axis=1, join_axes=[df1.index])