如何删除 Pandas 系列重复索引的额外副本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14395678/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to drop extra copy of duplicate index of Pandas Series?
提问by bigbug
I have a Series swith duplicate index :
我有一个s带有重复索引的系列:
>>> s
STK_ID RPT_Date
600809 20061231 demo_str
20070331 demo_str
20070630 demo_str
20070930 demo_str
20071231 demo_str
20060331 demo_str
20060630 demo_str
20060930 demo_str
20061231 demo_str
20070331 demo_str
20070630 demo_str
Name: STK_Name, Length: 11
And I just want to keep the unique rows and only one copy of the duplicate rows by:
我只想通过以下方式保留唯一行和重复行的一个副本:
s[s.index.unique()]
Pandas 0.10.1.dev-f7f7e13give the below error msg
Pandas 0.10.1.dev-f7f7e13给出以下错误消息
>>> s[s.index.unique()]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "d:\Python27\lib\site-packages\pandas\core\series.py", line 515, in __getitem__
return self._get_with(key)
File "d:\Python27\lib\site-packages\pandas\core\series.py", line 558, in _get_with
return self.reindex(key)
File "d:\Python27\lib\site-packages\pandas\core\series.py", line 2361, in reindex
level=level, limit=limit)
File "d:\Python27\lib\site-packages\pandas\core\index.py", line 2063, in reindex
limit=limit)
File "d:\Python27\lib\site-packages\pandas\core\index.py", line 2021, in get_indexer
raise Exception('Reindexing only valid with uniquely valued Index '
Exception: Reindexing only valid with uniquely valued Index objects
>>>
So how to drop extra duplicate rows of series, keep the unique rows and only one copy of the duplicate rows in an efficient way ? (better in one line)
那么如何以有效的方式删除额外的系列重复行,保留唯一行和重复行的一个副本?(最好在一行中)
回答by Zelazny7
You can groupby the index and apply a function that returns one value per index group. Here, I take the first value:
您可以按索引分组并应用一个函数,该函数为每个索引组返回一个值。在这里,我取第一个值:
In [1]: s = Series(range(10), index=[1,2,2,2,5,6,7,7,7,8])
In [2]: s
Out[2]:
1 0
2 1
2 2
2 3
5 4
6 5
7 6
7 7
7 8
8 9
In [3]: s.groupby(s.index).first()
Out[3]:
1 0
2 1
5 4
6 5
7 6
8 9
UPDATE
更新
Addressing BigBug's comment about crashing when passing a MultiIndex to Series.groupby():
解决 BigBug 关于将 MultiIndex 传递给 Series.groupby() 时崩溃的评论:
In [1]: s
Out[1]:
STK_ID RPT_Date
600809 20061231 demo
20070331 demo
20070630 demo
20070331 demo
In [2]: s.reset_index().groupby(s.index.names).first()
Out[2]:
0
STK_ID RPT_Date
600809 20061231 demo
20070331 demo
20070630 demo
回答by Anton Protopopov
You could subset your data with duplicated(which keeps first value by default) for index. With @Zelazny7 example:
您可以使用duplicated(默认情况下保留第一个值)为您的数据子集index。以@Zelazny7 为例:
s = pd.Series(range(10), index=[1,2,2,2,5,6,7,7,7,8])
In [130]: s[~s.index.duplicated()]
Out[130]:
1 0
2 1
5 4
6 5
7 6
8 9
dtype: int64
回答by bmu
One way would be using dropand index.get_duplicates:
一种方法是使用dropand index.get_duplicates:
In [43]: df
Out[43]:
String
STK_ID RPT_Date
600809 20061231 demo_string
20070331 demo_string
20070630 demo_string
20070930 demo_string
20071231 demo_string
20060331 demo_string
20060630 demo_string
20060930 demo_string
20061231 demo_string
20070331 demo_string
20070630 demo_string
In [44]: df.drop(df.index.get_duplicates())
Out[44]:
String
STK_ID RPT_Date
600809 20070930 demo_string
20071231 demo_string
20060331 demo_string
20060630 demo_string
20060930 demo_string

