Python 如何在 Pandas 系列中找到与输入数字最接近的值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30112202/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I find the closest values in a Pandas series to an input number?
提问by Steve
I have seen:
我见过:
- how do I find the closest value to a given number in an array?
- How do I find the closest array element to an arbitrary (non-member) number?.
These relate to vanilla python and not pandas.
这些与香草蟒蛇有关,而不是熊猫。
If I have the series:
如果我有这个系列:
ix num
0 1
1 6
2 4
3 5
4 2
And I input 3, how can I (efficiently) find?
我输入 3,我怎样才能(有效地)找到?
- The index of 3 if it is found in the series
- The index of the value below and above 3 if it is not found in the series.
- 如果在系列中找到,则索引为 3
- 如果在系列中找不到低于和高于 3 的值的索引。
Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).
IE。使用上述系列 {1,6,4,5,2} 和输入 3,我应该得到带有索引 (2,4) 的值 (4,2)。
回答by Zero
You could use argsort()
like
你可以使用argsort()
像
Say, input = 3
说, input = 3
In [198]: input = 3
In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]]
Out[199]:
num
2 4
4 2
df_sort
is the dataframe with 2 closest values.
df_sort
是具有 2 个最接近值的数据框。
In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]]
For index,
对于指数,
In [201]: df_sort.index.tolist()
Out[201]: [2, 4]
For values,
对于值,
In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]
Detail, for the above solution df
was
详细信息,对于上述解决方案df
是
In [197]: df
Out[197]:
num
0 1
1 6
2 4
3 5
4 2
回答by Zero
回答by kztd
If your series is already sorted, you could use something like this.
如果你的系列已经排序,你可以使用这样的东西。
def closest(df, col, val, direction):
n = len(df[df[col] <= val])
if(direction < 0):
n -= 1
if(n < 0 or n >= len(df)):
print('err - value outside range')
return None
return df.ix[n, col]
df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
lc = closest(df, 'num', find, -1)
hc = closest(df, 'num', find, 1)
print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))
df: num
0 0
1 2
2 4
3 6
4 8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.
回答by Ivo Merchiers
Apart from not completely answering the question, an extra disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).
除了不能完全回答这个问题之外,这里讨论的其他算法的一个额外缺点是它们必须对整个列表进行排序。这导致~N log(N)的复杂性。
However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.
但是,可以在~N 中获得相同的结果。这种方法将数据帧分成两个子集,一个比所需值小,一个大。较低的邻居小于较小数据帧中的最大值,反之亦然。
This gives the following code snippet:
这给出了以下代码片段:
def find_neighbours(value):
exactmatch=df[df.num==value]
if !exactmatch.empty:
return exactmatch.index
else:
lowerneighbour_ind = df[df.num<value].num.idxmax()
upperneighbour_ind = df[df.num>value].num.idxmin()
return [lowerneighbour_ind, upperneighbour_ind]
This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.
这种方法类似于在 pandas 中使用分区,这在处理大型数据集和复杂性成为问题时非常有用。
Comparing both strategies shows that for large N, the partitioning strategy is indeed faster. For small N, the sorting strategy will be more efficient, as it is implemented at a much lower level. It is also a one-liner, which might increase code readability.
比较两种策略表明,对于大 N,分区策略确实更快。对于小 N,排序策略将更有效,因为它是在低得多的级别实现的。它也是单行的,这可能会增加代码的可读性。
The code to replicate this plot can be seen below:
复制此图的代码如下所示:
from matplotlib import pyplot as plt
import pandas
import numpy
import timeit
value=3
sizes=numpy.logspace(2, 5, num=50, dtype=int)
sort_results, partition_results=[],[]
for size in sizes:
df=pandas.DataFrame({"num":100*numpy.random.random(size)})
sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",
globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
partition_results.append(timeit.Timer('find_neighbours(df,value)',
globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
sort_time=[time/amount for amount,time in sort_results]
partition_time=[time/amount for amount,time in partition_results]
plt.plot(sizes, sort_time)
plt.plot(sizes, partition_time)
plt.legend(['Sorting','Partitioning'])
plt.title('Comparison of strategies')
plt.xlabel('Size of Dataframe')
plt.ylabel('Time in s')
plt.savefig('speed_comparison.png')
回答by Maxence Bouvier
If the series is already sorted, an efficient method of finding the indexes is by using bisectfunctions. An example:
如果系列已经排序,查找索引的有效方法是使用二等分函数。一个例子:
idx = bisect_left(df['num'].values, 3)
Let's considerthat the column col
of the dataframe df
is sorted.
让我们考虑col
一下数据框的列df
是 sorted。
- In the case where the value
val
is in the column,bisect_left
will return the precise index of the value in the list andbisect_right
will return the index of the next position. - In the case where the value is not in the list, both
bisect_left
andbisect_right
will return the same index: the one where to insert the value to keep the list sorted.
- 如果值
val
在列中,bisect_left
将返回列表中值的精确索引,并bisect_right
返回下一个位置的索引。 - 在值为不在列表中,这两个案件
bisect_left
,并bisect_right
会返回相同的指标:一要插入值,以保持列表进行排序。
Hence, to answer the question, the following code gives the index of val
in col
if it is found, and the indexes of the closest values otherwise. This solution works even when the values in the list are not unique.
因此,为了回答这个问题,以下代码给出val
in的索引(col
如果找到),否则给出最接近值的索引。即使列表中的值不唯一,此解决方案也能工作。
from bisect import bisect_left, bisect_right
def get_closests(df, col, val):
lower_idx = bisect_left(df[col].values, val)
higher_idx = bisect_right(df[col].values, val)
if higher_idx == lower_idx: #val is not in the list
return lower_idx - 1, lower_idx
else: #val is in the list
return lower_idx
Bisect algorithms are very efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.
对分算法非常有效地在数据帧列“col”或其最近的邻居中找到特定值“val”的索引,但它需要对列表进行排序。
回答by Jim Hunziker
You can use numpy.searchsorted
. If your search column is not already sorted, you can make a DataFrame that is sorted and remember the mapping between them with pandas.argsort
. (This is better than the above methods if you plan on finding the closest value more than once.)
您可以使用numpy.searchsorted
. 如果您的搜索列尚未排序,您可以创建一个已排序的 DataFrame 并记住它们之间的映射pandas.argsort
。(如果您计划多次找到最接近的值,这比上述方法更好。)
Once it's sorted, find the closest values for your inputs like this:
排序后,为您的输入找到最接近的值,如下所示:
indLeft = np.searchsorted(df['column'], input, side='left')
indRight = np.searchsorted(df['column'], input, side='right')
valLeft = df['column'][indLeft]
valRight = df['column'][indRight]