pandas 重采样错误:无法使用方法或限制重新索引非唯一索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39792933/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Resampling Error : cannot reindex a non-unique index with a method or limit
提问by Arij SEDIRI
I am using Pandas to structure and process Data.
我正在使用 Pandas 来构建和处理数据。
I have here a DataFrame with dates as index, Id and bitrate. I want to group my Data by Id and resample, at the same time, timedates which are relative to every Id, and finally keep the bitrate score.
我这里有一个数据帧,日期为索引、Id 和比特率。我想按 Id 对我的数据进行分组,同时重新采样与每个 Id 相关的时间日期,最后保持比特率分数。
For example, given :
例如,给定:
df = pd.DataFrame(
{'Id' : ['CODI126640013.ts', 'CODI126622312.ts'],
'beginning_time':['2016-07-08 02:17:42', '2016-07-08 02:05:35'],
'end_time' :['2016-07-08 02:17:55', '2016-07-08 02:26:11'],
'bitrate': ['3750000', '3750000'],
'type' : ['vod', 'catchup'],
'unique_id' : ['f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30', 'f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22']})
which gives :
这使 :
This is my code to get a unique column for dates with every time the Id and the bitrate :
这是我的代码,用于获取每次 Id 和比特率的唯一日期列:
df = df.drop(['type', 'unique_id'], axis=1)
df.beginning_time = pd.to_datetime(df.beginning_time)
df.end_time = pd.to_datetime(df.end_time)
df = pd.melt(df, id_vars=['Id','bitrate'], value_name='dates').drop('variable', axis=1)
df.set_index('dates', inplace=True)
which gives :
这使 :
And now, time for Resample ! This is my code :
现在,是时候重新采样了!这是我的代码:
print (df.groupby('Id').resample('1S').ffill())
And this is the result :
这是结果:
This is exactly what I want to do ! I have 38279 logs with the same columns and I have an error message when I do the same thing. The first part works perfectly, and gives this :
这正是我想要做的!我有 38279 个具有相同列的日志,当我做同样的事情时,我有一条错误消息。第一部分完美运行,并给出了:
The part (df.groupby('Id').resample('1S').ffill())gives this error message :
部分(df.groupby('Id').resample('1S').ffill())给出了这个错误信息:
ValueError: cannot reindex a non-unique index with a method or limit
Any ideas ? Thnx !
有任何想法吗 ?谢谢!
回答by jezrael
It seems there is problem with duplicates in columns beginning_time
and end_time
, I try simulate it:
它似乎有一个在列重复的问题beginning_time
和end_time
,我试着模拟它:
df = pd.DataFrame(
{'Id' : ['CODI126640013.ts', 'CODI126622312.ts', 'a'],
'beginning_time':['2016-07-08 02:17:42', '2016-07-08 02:17:42', '2016-07-08 02:17:45'],
'end_time' :['2016-07-08 02:17:42', '2016-07-08 02:17:42', '2016-07-08 02:17:42'],
'bitrate': ['3750000', '3750000', '444'],
'type' : ['vod', 'catchup', 's'],
'unique_id':['f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30', 'f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22','w']})
print (df)
Id beginning_time bitrate end_time \
0 CODI126640013.ts 2016-07-08 02:17:42 3750000 2016-07-08 02:17:42
1 CODI126622312.ts 2016-07-08 02:17:42 3750000 2016-07-08 02:17:42
2 a 2016-07-08 02:17:45 444 2016-07-08 02:17:42
type unique_id
0 vod f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30
1 catchup f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22
2 s w
df = df.drop(['type', 'unique_id'], axis=1)
df.beginning_time = pd.to_datetime(df.beginning_time)
df.end_time = pd.to_datetime(df.end_time)
df = pd.melt(df, id_vars=['Id','bitrate'], value_name='dates').drop('variable', axis=1)
df.set_index('dates', inplace=True)
print (df)
Id bitrate
dates
2016-07-08 02:17:42 CODI126640013.ts 3750000
2016-07-08 02:17:42 CODI126622312.ts 3750000
2016-07-08 02:17:45 a 444
2016-07-08 02:17:42 CODI126640013.ts 3750000
2016-07-08 02:17:42 CODI126622312.ts 3750000
2016-07-08 02:17:42 a 444
print (df.groupby('Id').resample('1S').ffill())
ValueError: cannot reindex a non-unique index with a method or limit
ValueError:无法使用方法或限制重新索引非唯一索引
One possible solution is add drop_duplicates
and use old wayfor resample
with groupby
:
一个可能的解决方案是增加drop_duplicates
和使用旧的方式进行resample
使用groupby
:
df = df.drop(['type', 'unique_id'], axis=1)
df.beginning_time = pd.to_datetime(df.beginning_time)
df.end_time = pd.to_datetime(df.end_time)
df = pd.melt(df, id_vars=['Id','bitrate'], value_name='dates').drop('variable', axis=1)
print (df.groupby('Id').apply(lambda x : x.drop_duplicates('dates')
.set_index('dates')
.resample('1S')
.ffill()))
Id bitrate
Id dates
CODI126622312.ts 2016-07-08 02:17:42 CODI126622312.ts 3750000
CODI126640013.ts 2016-07-08 02:17:42 CODI126640013.ts 3750000
a 2016-07-08 02:17:41 a 444
2016-07-08 02:17:42 a 444
2016-07-08 02:17:43 a 444
2016-07-08 02:17:44 a 444
2016-07-08 02:17:45 a 444
You can also check duplicates by boolean indexing
:
您还可以通过boolean indexing
以下方式检查重复项:
print (df[df.beginning_time == df.end_time])
2 s w
Id beginning_time bitrate end_time \
0 CODI126640013.ts 2016-07-08 02:17:42 3750000 2016-07-08 02:17:42
1 CODI126622312.ts 2016-07-08 02:17:42 3750000 2016-07-08 02:17:42
type unique_id
0 vod f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30
1 catchup f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22