pandas 重采样错误：无法使用方法或限制重新索引非唯一索引

Question

提问by Arij SEDIRI

I am using Pandas to structure and process Data.

我正在使用 Pandas 来构建和处理数据。

I have here a DataFrame with dates as index, Id and bitrate. I want to group my Data by Id and resample, at the same time, timedates which are relative to every Id, and finally keep the bitrate score.

我这里有一个数据帧，日期为索引、Id 和比特率。我想按 Id 对我的数据进行分组，同时重新采样与每个 Id 相关的时间日期，最后保持比特率分数。

For example, given :

例如，给定：

df = pd.DataFrame(
{'Id' : ['CODI126640013.ts', 'CODI126622312.ts'],
'beginning_time':['2016-07-08 02:17:42', '2016-07-08 02:05:35'], 
'end_time' :['2016-07-08 02:17:55', '2016-07-08 02:26:11'],
'bitrate': ['3750000', '3750000'],
'type' : ['vod', 'catchup'],
'unique_id' : ['f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30', 'f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22']})

which gives :

这使：

This is my code to get a unique column for dates with every time the Id and the bitrate :

这是我的代码，用于获取每次 Id 和比特率的唯一日期列：

df = df.drop(['type', 'unique_id'], axis=1)
df.beginning_time = pd.to_datetime(df.beginning_time)
df.end_time = pd.to_datetime(df.end_time)
df = pd.melt(df, id_vars=['Id','bitrate'], value_name='dates').drop('variable', axis=1)
df.set_index('dates', inplace=True)

which gives :

这使：

And now, time for Resample ! This is my code :

现在，是时候重新采样了！这是我的代码：

print (df.groupby('Id').resample('1S').ffill())

And this is the result :

这是结果：

This is exactly what I want to do ! I have 38279 logs with the same columns and I have an error message when I do the same thing. The first part works perfectly, and gives this :

这正是我想要做的！我有 38279 个具有相同列的日志，当我做同样的事情时，我有一条错误消息。第一部分完美运行，并给出了：

The part (df.groupby('Id').resample('1S').ffill())gives this error message :

部分(df.groupby('Id').resample('1S').ffill())给出了这个错误信息：

ValueError: cannot reindex a non-unique index with a method or limit

Any ideas ? Thnx !

有任何想法吗？谢谢！

Answer 1

回答by jezrael

It seems there is problem with duplicates in columns beginning_timeand end_time, I try simulate it:

它似乎有一个在列重复的问题beginning_time和end_time，我试着模拟它：

df = pd.DataFrame(
{'Id' : ['CODI126640013.ts', 'CODI126622312.ts', 'a'],
'beginning_time':['2016-07-08 02:17:42', '2016-07-08 02:17:42', '2016-07-08 02:17:45'], 
'end_time' :['2016-07-08 02:17:42', '2016-07-08 02:17:42', '2016-07-08 02:17:42'],
'bitrate': ['3750000', '3750000', '444'],
'type' : ['vod', 'catchup', 's'],
'unique_id':['f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30', 'f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22','w']})

print (df)  
                 Id       beginning_time  bitrate             end_time  \
0  CODI126640013.ts  2016-07-08 02:17:42  3750000  2016-07-08 02:17:42   
1  CODI126622312.ts  2016-07-08 02:17:42  3750000  2016-07-08 02:17:42   
2                 a  2016-07-08 02:17:45      444  2016-07-08 02:17:42   

      type                             unique_id  
0      vod  f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30  
1  catchup  f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22  
2        s                                     w

df = df.drop(['type', 'unique_id'], axis=1)
df.beginning_time = pd.to_datetime(df.beginning_time)
df.end_time = pd.to_datetime(df.end_time)
df = pd.melt(df, id_vars=['Id','bitrate'], value_name='dates').drop('variable', axis=1)
df.set_index('dates', inplace=True)


print (df)  
                                   Id  bitrate
dates                                         
2016-07-08 02:17:42  CODI126640013.ts  3750000
2016-07-08 02:17:42  CODI126622312.ts  3750000
2016-07-08 02:17:45                 a      444
2016-07-08 02:17:42  CODI126640013.ts  3750000
2016-07-08 02:17:42  CODI126622312.ts  3750000
2016-07-08 02:17:42                 a      444

print (df.groupby('Id').resample('1S').ffill())

ValueError: cannot reindex a non-unique index with a method or limit

ValueError：无法使用方法或限制重新索引非唯一索引

One possible solution is add drop_duplicatesand use old wayfor resamplewith groupby:

一个可能的解决方案是增加drop_duplicates和使用旧的方式进行resample使用groupby：

df = df.drop(['type', 'unique_id'], axis=1)
df.beginning_time = pd.to_datetime(df.beginning_time)
df.end_time = pd.to_datetime(df.end_time)
df = pd.melt(df, id_vars=['Id','bitrate'], value_name='dates').drop('variable', axis=1)

print (df.groupby('Id').apply(lambda x : x.drop_duplicates('dates')
                                          .set_index('dates')
                                          .resample('1S')
                                          .ffill()))

                                                    Id  bitrate
Id               dates                                         
CODI126622312.ts 2016-07-08 02:17:42  CODI126622312.ts  3750000
CODI126640013.ts 2016-07-08 02:17:42  CODI126640013.ts  3750000
a                2016-07-08 02:17:41                 a      444
                 2016-07-08 02:17:42                 a      444
                 2016-07-08 02:17:43                 a      444
                 2016-07-08 02:17:44                 a      444
                 2016-07-08 02:17:45                 a      444

You can also check duplicates by boolean indexing:

您还可以通过boolean indexing以下方式检查重复项：

print (df[df.beginning_time == df.end_time])
2        s                                     w  
                 Id       beginning_time  bitrate             end_time  \
0  CODI126640013.ts  2016-07-08 02:17:42  3750000  2016-07-08 02:17:42   
1  CODI126622312.ts  2016-07-08 02:17:42  3750000  2016-07-08 02:17:42   

      type                             unique_id  
0      vod  f2514f6b-ce7e-4e1a-8f6a-3ac5d524be30  
1  catchup  f2514f6b-ce7e-4e1a-8f6a-3ac5d524bb22

pandas 重采样错误：无法使用方法或限制重新索引非唯一索引

提问by Arij SEDIRI

回答by jezrael

相关推荐

最近更新

标签

pandas 重采样错误：无法使用方法或限制重新索引非唯一索引

提问by Arij SEDIRI

回答by jezrael

相关推荐

pandas 如何将 numpy 数组分成更小的块/批次，然后遍历它们

pandas 由pivot_table引入的Pandas NaN

使用 .csv 格式的 HDFS 文件创建 Pandas DataFrame

Pandas：有条件的 groupby

相关推荐

最近更新

标签