pandas 重新索引数据帧的问题:重新索引仅对唯一值的索引对象有效

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14180615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:34:13  来源:igfitidea点击:

problems with reindexing dataframes: Reindexing only valid with uniquely valued Index objects

dataframepandasreindex

提问by mspadaccino

I am having a real strange behaviour when trying to reindex a dataframe in pandas. My version of Pandas is 0.10.0 and I use Python 2.7. Basically, when I load a dataframe:

尝试在 Pandas 中重新索引数据框时,我有一个真正奇怪的行为。我的 Pandas 版本是 0.10.0,我使用 Python 2.7。基本上,当我加载数据框时:

eurusd = pd.DataFrame.load('EUR_USD_30Min.df').drop_duplicates().dropna()

eurusd

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 119710 entries, 2003-02-02 17:30:00 to 2012-12-28 17:00:00
Data columns:
open     119710  non-null values
high     119710  non-null values
low      119710  non-null values
close    119710  non-null values
dtypes: float64(4)

and then I try to reindex inside a larger date range:

然后我尝试在更大的日期范围内重新索引:

newindex  = pd.DateRange(datetime.datetime(2002,1,1), datetime.datetime(2012,12,31), offset=pd.datetools.Minute(30))

newindex

<class 'pandas.tseries.index.DatetimeIndex'>
[2002-01-01 00:00:00, ..., 2012-12-31 00:00:00]
Length: 192817, Freq: 30T, Timezone: None

I get strange behaviour when trying to reindex the dataframe. If I reindex one larger part of the dataset I get this error:

尝试重新索引数据框时出现奇怪的行为。如果我重新索引数据集的较大部分,则会出现此错误:

eurusd[29558:29560].reindex(index=newindex)

Exception: Reindexing only valid with uniquely valued Index objects

But, if I do the same for two subsets of the data above, I don't get the error:

但是,如果我对上述数据的两个子集执行相同操作,则不会出现错误:

Here's the first subset, with no problems,

这是第一个子集,没有问题,

eurusd[29558:29559].reindex(index=newindex)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 192817 entries, 2002-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: 30T
Data columns:
open     1  non-null values
high     1  non-null values
low      1  non-null values
close    1  non-null values
dtypes: float64(4)

and here's the second subset, still no problems,

这是第二个子集,仍然没有问题,

eurusd[29559:29560].reindex(index=newindex)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 192817 entries, 2002-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: 30T
Data columns:
open     1  non-null values
high     1  non-null values
low      1  non-null values
close    1  non-null values
dtypes: float64(4)

I am really going crazy about this, and cannot understand the reasons of this. It seems like the dataframe is 'clean' from duplicates, and duplicated indexes.... I can provide the pickle file for the dataframe if you want.

我真的对此很疯狂,无法理解其中的原因。数据框似乎从重复项和重复索引中“干净”了......如果你愿意,我可以为数据框提供pickle文件。

回答by Andy Hayden

You could groupby the index and take the first entry (see docs):

您可以按索引分组并获取第一个条目(请参阅文档):

df.groupby(level=0).first()

Example:

例子:

In [1]: df = pd.DataFrame([[1], [2]], index=[1, 1])

In [2]: df
Out[2]: 
   0
1  1
1  2

In [3]: df.groupby(level=0).first()
Out[3]: 
   0
1  1