Python Pandas:获取重复的索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20199129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:51:03  来源:igfitidea点击:

Pandas: Get duplicated indexes

pythonindexingpandas

提问by Olga Botvinnik

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

给定一个数据框,我想获取列中没有重复值的重复索引,并查看哪些值不同。

Specifically, I have this dataframe:

具体来说,我有这个数据框:

import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)

In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

并且一些索引在第 9 列(该位置的 DNA 重复元素的类型)中有重复值,我想知道各个位置(每个索引 = 一个基因组位置)的不同类型的重复元素是什么。

I'm guessing this will require some kind of groupbyand hopefully some groupbyninja can help me out.

我猜这将需要某种方式,groupby希望一些groupby忍者可以帮助我。

To simplify even further, if we only have the index and the repeat type,

为了进一步简化,如果我们只有索引和重复类型,

genome_location1    MIR3
genome_location1    AluJb
genome_location2    Tigger1
genome_location3    AT_rich

So the output I'd like to see all duplicate indexes and their repeat types, as such:

因此,我想查看所有重复索引及其重复类型的输出,如下所示:

genome_location1    MIR3
genome_location1    AluJb

EDIT: added toy example

编辑:添加玩具示例

采纳答案by Dan Allan

df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

We added filtermethod for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

我们filter为这种操作添加了方法。您还可以使用遮罩和变换来获得等效的结果,但这样速度更快,可读性也更高一些。

Important:

重要的:

The filtermethod was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transformon Series -- was fixed for version 0.13, which should be released any day now.

filter方法是在 0.12 版本中引入的,但它无法在具有非唯一索引的 DataFrames/Series 上工作。该问题 - 以及与transform系列相关的问题- 已在 0.13 版中修复,该版本现在应该随时发布。

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transformworkaround is the way to go. Be ware that if you try that on a Serieswith a nonunique index, it too will fail.

显然,非唯一索引是这个问题的核心,所以我应该指出,这种方法在您拥有 pandas 0.13 之前无济于事。与此同时,transform解决方法是要走的路。请注意,如果您在具有非唯一索引的系列上尝试这样做,它也会失败。

There is no good reason why filterand transformshould not be applied to nonunique indexes; it was just poorly implemented at first.

有没有很好的理由filtertransform不应该适用于非唯一索引; 起初只是执行得很差。

回答by Roman Pekar

>>> df[df.groupby(level=0).transform(len)['type'] > 1]
                   type
genome_location1   MIR3
genome_location1  AluJb

回答by Fil

More succinctly:

更简洁:

df[df.groupby(level=0).type.count() > 1]

FYI a multi-index:

仅供参考:

df[df.groupby(level=[0,1]).type.count() > 1]

回答by Matthew

Also useful and very succinct:

同样有用且非常简洁:

df[df.index.duplicated()]

df[df.index.duplicated()]

Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:

请注意,这仅返回重复行之一,因此要查看所有重复行,您需要这样做:

df[df.index.duplicated(keep=False)]

df[df.index.duplicated(keep=False)]

回答by feetwet

Even faster and better:

更快更好:

df.index.get_duplicates()

回答by Jwok

As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:

截至 2018 年 9 月 21 日,熊猫表示FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release,而不是建议以下内容:

df.index[df.index.duplicated()].unique()