Python 在熊猫系列中查找元素的索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18327624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:29:40  来源:igfitidea点击:

Find element's index in pandas Series

pythonpandas

提问by sashkello

I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)

我知道这是一个非常基本的问题,但由于某种原因我找不到答案。如何在 python pandas 中获取某个系列元素的索引?(第一次出现就足够了)

I.e., I'd like something like:

即,我想要类似的东西:

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

Certainly, it is possible to define such a method with a loop:

当然,可以用循环定义这样的方法:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

but I assume there should be a better way. Is there?

但我认为应该有更好的方法。在那儿?

采纳答案by Viktor Kerkez

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.

虽然我承认应该有更好的方法来做到这一点,但这至少避免了迭代和循环对象并将其移动到 C 级别。

回答by Jeff

Converting to an Index, you can use get_loc

转换为索引,您可以使用 get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

Duplicate handling

重复处理

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

Will return a boolean array if non-contiguous returns

如果不连续返回,将返回一个布尔数组

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

Uses a hashtable internally, so fast

在内部使用哈希表,如此之快

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 μs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 μs per loop

As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)

正如 Viktor 指出的那样,创建索引有一次性的创建开销(当您实际对索引执行某些操作时会产生这种开销,例如is_unique

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 μs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 μs per loop

回答by Alex Spangher

Another way to do this, although equally unsatisfying is:

另一种方法来做到这一点,虽然同样不令人满意的是:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

returns: 3

回报:3

On time tests using a current dataset I'm working with (consider it random):

使用我正在使用的当前数据集进行时间测试(认为它是随机的):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 μs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 μs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 μs per loop

回答by Alon

In [92]: (myseries==7).argmax()
Out[92]: 3

This works if you know 7 is there in advance. You can check this with (myseries==7).any()

如果您提前知道 7 存在,则此方法有效。你可以用 (myseries==7).any() 检查这个

Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is

另一种方法(与第一个答案非常相似)也考虑了多个 7(或没有)是

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

回答by Alex

If you use numpy, you can get an array of the indecies that your value is found:

如果您使用 numpy,您可以获得找到您的值的 indecies 数组:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:

这将返回一个包含 indecies 数组的单元素元组,其中 7 是 myseries 中的值:

(array([3], dtype=int64),)

回答by Raki Gade

you can use Series.idxmax()

你可以使用 Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

回答by Ulf Aslak

Often your value occurs at multiple indices:

通常,您的值出现在多个索引处:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

回答by Bill

I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.

我对这里的所有答案印象深刻。这不是一个新答案,只是试图总结所有这些方法的时间。我考虑了具有 25 个元素的系列的情况,并假设了索引可以包含任何值的一般情况,并且您希望索引值对应于接近系列末尾的搜索值。

Here are the speed tests on a 2013 MacBook Pro in Python 3.7 with Pandas version 0.25.3.

以下是使用 Python 3.7 和 Pandas 版本 0.25.3 在 2013 年 MacBook Pro 上的速度测试。

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 μs ± 5.05 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 μs ± 32.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 μs ± 23.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 μs ± 1.18 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 μs ± 9.35 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 μs ± 1.41 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 μs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 μs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff's answer seems to be the fastest - although it doesn't handle duplicates.

@Jeff 的答案似乎是最快的——尽管它不处理重复项。

Correction: Sorry, I missed one, @Alex Spangher's solution using the list index method is by far the fastest.

更正:对不起,我错过了一个,@Alex Spangher 使用列表索引方法的解决方案是迄今为止最快的。

Update: Added @EliadL's answer.

更新:添加了@EliadL 的答案。

Hope this helps.

希望这可以帮助。

Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.

令人惊讶的是,如此简单的操作需要如此复杂的解决方案,而且许多解决方案如此缓慢。在某些情况下超过半毫秒才能在 25 的系列中找到一个值。

回答by rmutalik

Another way to do it that hasn't been mentioned yet is the tolist method:

另一种尚未提及的方法是 tolist 方法:

myseries.tolist().index(7)

should return the correct index, assuming the value exists in the Series.

应该返回正确的索引,假设值存在于系列中。

回答by EliadL

This is the most native and scalable approach I could find:

这是我能找到的最原生和可扩展的方法:

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64