Python 如何在 Pandas 系列中找到与输入数字最接近的值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30112202/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 07:58:30  来源:igfitidea点击:

How do I find the closest values in a Pandas series to an input number?

pythonpandasdataframeranking

提问by Steve

I have seen:

我见过:

These relate to vanilla python and not pandas.

这些与香草蟒蛇有关,而不是熊猫。

If I have the series:

如果我有这个系列:

ix   num  
0    1
1    6
2    4
3    5
4    2

And I input 3, how can I (efficiently) find?

我输入 3,我怎样才能(有效地)找到?

  1. The index of 3 if it is found in the series
  2. The index of the value below and above 3 if it is not found in the series.
  1. 如果在系列中找到,则索引为 3
  2. 如果在系列中找不到低于和高于 3 的值的索引。

Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).

IE。使用上述系列 {1,6,4,5,2} 和输入 3,我应该得到带有索引 (2,4) 的值 (4,2)。

回答by Zero

You could use argsort()like

你可以使用argsort()

Say, input = 3

说, input = 3

In [198]: input = 3

In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]]
Out[199]:
   num
2    4
4    2

df_sortis the dataframe with 2 closest values.

df_sort是具有 2 个最接近值的数据框。

In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]]

For index,

对于指数,

In [201]: df_sort.index.tolist()
Out[201]: [2, 4]

For values,

对于值,

In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]


Detail, for the above solution dfwas

详细信息,对于上述解决方案df

In [197]: df
Out[197]:
   num
0    1
1    6
2    4
3    5
4    2

回答by Zero

I recommend using ilocin addition to John Galt's answer since this will work even with unsorted integer index, since .ixfirst looks at the index labels

我建议iloc在 John Galt 的答案之外使用,因为即使使用未排序的整数索引也可以使用,因为.ix首先查看索引标签

df.iloc[(df['num']-input).abs().argsort()[:2]]

回答by kztd

If your series is already sorted, you could use something like this.

如果你的系列已经排序,你可以使用这样的东西。

def closest(df, col, val, direction):
    n = len(df[df[col] <= val])
    if(direction < 0):
        n -= 1
    if(n < 0 or n >= len(df)):
        print('err - value outside range')
        return None
    return df.ix[n, col]    

df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
    lc = closest(df, 'num', find, -1)
    hc = closest(df, 'num', find, 1)
    print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))


df:     num
    0   0
    1   2
    2   4
    3   6
    4   8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.

回答by Ivo Merchiers

Apart from not completely answering the question, an extra disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).

除了不能完全回答这个问题之外,这里讨论的其他算法的一个额外缺点是它们必须对整个列表进行排序。这导致~N log(N)的复杂性。

However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.

但是,可以在~N 中获得相同的结果。这种方法将数据帧分成两个子集,一个比所需值小,一个大。较低的邻居小于较小数据帧中的最大值,反之亦然。

This gives the following code snippet:

这给出了以下代码片段:

def find_neighbours(value):
  exactmatch=df[df.num==value]
  if !exactmatch.empty:
      return exactmatch.index
  else:
      lowerneighbour_ind = df[df.num<value].num.idxmax()
      upperneighbour_ind = df[df.num>value].num.idxmin()
      return [lowerneighbour_ind, upperneighbour_ind]

This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.

这种方法类似于在 pandas 中使用分区,这在处理大型数据集和复杂性成为问题时非常有用。



Comparing both strategies shows that for large N, the partitioning strategy is indeed faster. For small N, the sorting strategy will be more efficient, as it is implemented at a much lower level. It is also a one-liner, which might increase code readability. Comparison of partitioning vs sorting

比较两种策略表明,对于大 N,分区策略确实更快。对于小 N,排序策略将更有效,因为它是在低得多的级别实现的。它也是单行的,这可能会增加代码的可读性。 分区与排序的比较

The code to replicate this plot can be seen below:

复制此图的代码如下所示:

from matplotlib import pyplot as plt
import pandas
import numpy
import timeit

value=3
sizes=numpy.logspace(2, 5, num=50, dtype=int)

sort_results, partition_results=[],[]
for size in sizes:
    df=pandas.DataFrame({"num":100*numpy.random.random(size)})

    sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",
                                         globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
    partition_results.append(timeit.Timer('find_neighbours(df,value)',
                                          globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())

sort_time=[time/amount for amount,time in sort_results]
partition_time=[time/amount for amount,time in partition_results]

plt.plot(sizes, sort_time)
plt.plot(sizes, partition_time)
plt.legend(['Sorting','Partitioning'])
plt.title('Comparison of strategies')
plt.xlabel('Size of Dataframe')
plt.ylabel('Time in s')
plt.savefig('speed_comparison.png')

回答by Maxence Bouvier

If the series is already sorted, an efficient method of finding the indexes is by using bisectfunctions. An example:

如果系列已经排序,查找索引的有效方法是使用二等分函数。一个例子:

idx = bisect_left(df['num'].values, 3)

Let's considerthat the column colof the dataframe dfis sorted.

让我们考虑col一下数据框的列df是 sorted

  • In the case where the value valis in the column, bisect_leftwill return the precise index of the value in the list and bisect_rightwill return the index of the next position.
  • In the case where the value is not in the list, both bisect_leftand bisect_rightwill return the same index: the one where to insert the value to keep the list sorted.
  • 如果值val在列中,bisect_left将返回列表中值的精确索引,并 bisect_right返回下一个位置的索引。
  • 在值为不在列表中,这两个案件bisect_left,并bisect_right会返回相同的指标:一要插入值,以保持列表进行排序。

Hence, to answer the question, the following code gives the index of valin colif it is found, and the indexes of the closest values otherwise. This solution works even when the values in the list are not unique.

因此,为了回答这个问题,以下代码给出valin的索引(col如果找到),否则给出最接近值的索引。即使列表中的值不唯一,此解决方案也能工作。

from bisect import bisect_left, bisect_right
def get_closests(df, col, val):
    lower_idx = bisect_left(df[col].values, val)
    higher_idx = bisect_right(df[col].values, val)
if higher_idx == lower_idx:      #val is not in the list
    return lower_idx - 1, lower_idx
else:                            #val is in the list
    return lower_idx

Bisect algorithms are very efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.

对分算法非常有效地在数据帧列“col”或其最近的邻居中找到特定值“val”的索引,但它需要对列表进行排序。

回答by Jim Hunziker

You can use numpy.searchsorted. If your search column is not already sorted, you can make a DataFrame that is sorted and remember the mapping between them with pandas.argsort. (This is better than the above methods if you plan on finding the closest value more than once.)

您可以使用numpy.searchsorted. 如果您的搜索列尚未排序,您可以创建一个已排序的 DataFrame 并记住它们之间的映射pandas.argsort。(如果您计划多次找到最接近的值,这比上述方法更好。)

Once it's sorted, find the closest values for your inputs like this:

排序后,为您的输入找到最接近的值,如下所示:

indLeft = np.searchsorted(df['column'], input, side='left')
indRight = np.searchsorted(df['column'], input, side='right')

valLeft = df['column'][indLeft]
valRight = df['column'][indRight]