快速Haversine近似（Python/Pandas）

Question

提问by Nyxynyx

Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!

Pandas 数据框中的每一行都包含 2 个点的经纬度坐标。使用下面的 Python 代码，为许多（数百万）行计算这 2 个点之间的距离需要很长时间！

Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?

考虑到2个点相距不到50英里，精度不是很重要，是否可以使计算速度更快？

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km


for index, row in df.iterrows():
    df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])

Answer 1

回答by derricw

Here is a vectorized numpy version of the same function:

这是相同函数的矢量化 numpy 版本：

import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.

输入都是值的数组，它应该能够立即处理数百万个点。要求是输入是 ndarrays，但您的 Pandas 表的列将起作用。

For example, with randomly generated values:

例如，使用随机生成的值：

>>> import numpy as np
>>> import pandas
>>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000)
>>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
>>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

Or if you want to create another column:

或者，如果您想创建另一列：

>>> df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.

在 python 中循环数据数组非常慢。Numpy 提供了对整个数据数组进行操作的函数，可让您避免循环并显着提高性能。

This is an example of vectorization.

这是矢量化的一个例子。

Answer 2

回答by ely

Purely for the sake of an illustrative example, I took the numpyversion in the answer from @ballsdotballs and also made a companion C implementation to be called via ctypes. Since numpyis such a highly optimized tool, there is little chance that my C code will be as efficient, but it should be somewhat close. The big advantage here is that by running through an example with C types, it can help you see how you can connect up your own personal C functions to Python without too much overhead. This is especially nice when you just want to optimize a small piece of a bigger computation by writing that small piece in some C source rather than Python. Simply using numpywill solve the problem most of the time, but for those cases when you don't really need all of numpyand you don't want to add the coupling to require use of numpydata types throughout some code, it's very handy to know how to drop down to the built-in ctypeslibrary and do it yourself.

纯粹为了说明性示例，我numpy从@ballsdotballs 的答案中获取了版本，并且还制作了一个配套的 C 实现以通过调用ctypes。既然numpy是这样一个高度优化的工具，我的 C 代码几乎不可能有同样的效率，但应该有点接近。这里的一大优势是，通过运行 C 类型的示例，它可以帮助您了解如何将您自己的个人 C 函数连接到 Python，而无需太多开销。当您只想通过在某些 C 源代码而不是 Python 中编写一小部分来优化较大计算的一小部分时，这尤其好。numpy大多数情况下，简单地使用将解决问题，但对于那些您并不真正需要所有numpy并且您不想numpy在某些代码中添加耦合以要求使用数据类型，知道如何下拉到内置ctypes库并自己完成是非常方便的。

First let's create our C source file, called haversine.c:

首先让我们创建我们的 C 源文件，称为haversine.c：

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int haversine(size_t n, 
              double *lon1, 
              double *lat1, 
              double *lon2, 
              double *lat2,
              double *kms){

    if (   lon1 == NULL 
        || lon2 == NULL 
        || lat1 == NULL 
        || lat2 == NULL
        || kms == NULL){
        return -1;
    }

    double km, dlon, dlat;
    double iter_lon1, iter_lon2, iter_lat1, iter_lat2;

    double km_conversion = 2.0 * 6367.0; 
    double degrees2radians = 3.14159/180.0;

    int i;
    for(i=0; i < n; i++){
        iter_lon1 = lon1[i] * degrees2radians;
        iter_lat1 = lat1[i] * degrees2radians;
        iter_lon2 = lon2[i] * degrees2radians;
        iter_lat2 = lat2[i] * degrees2radians;

        dlon = iter_lon2 - iter_lon1;
        dlat = iter_lat2 - iter_lat1;

        km = pow(sin(dlat/2.0), 2.0) 
           + cos(iter_lat1) * cos(iter_lat2) * pow(sin(dlon/2.0), 2.0);

        kms[i] = km_conversion * asin(sqrt(km));
    }

    return 0;
}

// main function for testing
int main(void) {
    double lat1[2] = {16.8, 27.4};
    double lon1[2] = {8.44, 1.23};
    double lat2[2] = {33.5, 20.07};
    double lon2[2] = {14.88, 3.05};
    double kms[2]  = {0.0, 0.0};
    size_t arr_size = 2;

    int res;
    res = haversine(arr_size, lon1, lat1, lon2, lat2, kms);
    printf("%d\n", res);

    int i;
    for (i=0; i < arr_size; i++){
        printf("%3.3f, ", kms[i]);
    }
    printf("\n");
}

Note that we're trying to keep with C conventions. Explicitly passing data arguments by reference, using size_tfor a size variable, and expecting our haversinefunction to work by mutating one of the passed inputs such that it will contain the expected data on exit. The function actually returns an integer, which is a success/failure flag that could be used by other C-level consumers of the function.

请注意，我们试图保持 C 的约定。通过引用显式传递数据参数，size_t用于大小变量，并期望我们的haversine函数通过改变传递的输入之一来工作，以便它在退出时包含预期的数据。该函数实际上返回一个整数，这是一个成功/失败标志，该函数的其他 C 级使用者可以使用该标志。

We're going to need to find a way to handle all of these little C-specific issues inside of Python.

我们需要找到一种方法来处理 Python 内部所有这些与 C 相关的小问题。

Next let's put our numpyversion of the function along with some imports and some test data into a file called haversine.py:

接下来让我们将我们numpy的函数版本以及一些导入和一些测试数据放入一个名为的文件中haversine.py：

import time
import ctypes
import numpy as np
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

if __name__ == "__main__":
    lat1 = 50.0 * np.random.rand(1000000)
    lon1 = 50.0 * np.random.rand(1000000)
    lat2 = 50.0 * np.random.rand(1000000)
    lon2 = 50.0 * np.random.rand(1000000)

    t0 = time.time()
    r1 = haversine(lon1, lat1, lon2, lat2)
    t1 = time.time()
    print t1-t0, r1

I chose to make lats and lons (in degrees) that are randomly chosen between 0 and 50, but it doesn't matter too much for this explanation.

我选择制作在 0 到 50 之间随机选择的 lats 和 lons（以度为单位），但对于这个解释来说并不重要。

The next thing we need to do is to compile our C module in such a way that it can be dynamically loaded by Python. I'm using a Linux system (you can find examples for other systems very easily on Google), so my goal is to compile haversine.cinto a shared object, like so:

接下来我们需要做的是编译我们的 C 模块，使其可以被 Python 动态加载。我使用的是 Linux 系统（您可以在 Google 上很容易地找到其他系统的示例），所以我的目标是编译haversine.c成一个共享对象，如下所示：

gcc -shared -o haversine.so -fPIC haversine.c -lm

We can also compile to an executable and run it to see what the C program's mainfunction displays:

我们也可以编译成一个可执行文件并运行它来查看 C 程序的main函数显示的内容：

> gcc haversine.c -o haversine -lm
> ./haversine
0
1964.322, 835.278,

Now that we have compiled the shared object haversine.so, we can use ctypesto load it in Python and we need to supply the path to the file to do so:

现在我们已经编译了共享对象haversine.so，我们可以使用ctypesPython 加载它，我们需要提供文件的路径来执行此操作：

lib_path = "/path/to/haversine.so" # Obviously use your real path here.
haversine_lib = ctypes.CDLL(lib_path)

Now haversine_lib.haversineacts pretty much just like a Python function, except that we might need to do some manual type marshaling to make sure the inputs and outputs are interpreted correctly.

现在的haversine_lib.haversine行为与 Python 函数非常相似，只是我们可能需要进行一些手动类型编组以确保正确解释输入和输出。

numpyactually provides some nice tools for this and the one I'll use here is numpy.ctypeslib. We're going to build a pointer typethat will allow us to pass around numpy.ndarraysto these ctypes-loaded functions as through they were pointers. Here's the code:

numpy实际上为此提供了一些不错的工具，我将在这里使用的是numpy.ctypeslib. 我们将构建一个指针类型，它允许我们像通过指针一样传递numpy.ndarrays给这些ctypes加载的函数。这是代码：

arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, 
                                       ndim=1, 
                                       flags='CONTIGUOUS')

haversine_lib.haversine.restype = ctypes.c_int
haversine_lib.haversine.argtypes = [ctypes.c_size_t,
                                    arr_1d_double, 
                                    arr_1d_double,
                                    arr_1d_double,
                                    arr_1d_double,
                                    arr_1d_double]

Notice that we tell the haversine_lib.haversinefunction proxy to interpret its arguments according to the types we want.

请注意，我们告诉haversine_lib.haversine函数代理根据我们想要的类型解释它的参数。

Now, to test it out from Pythonwhat remains is to just make a size variable, and an array that will be mutated (just like in the C code) to contain the result data, then we can call it:

现在，要从 Python 中测试它，剩下的就是创建一个大小变量和一个将被改变的数组（就像在 C 代码中一样）以包含结果数据，然后我们可以调用它：

size = len(lat1)
output = np.empty(size, dtype=np.double)
print "====="
print output
t2 = time.time()
res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)
t3 = time.time()
print t3 - t2, res
print type(output), output

Putting it all together in the __main__block of haversine.py, the whole file now looks like this:

将它们全部放在__main__块中haversine.py，整个文件现在看起来像这样：

import time
import ctypes
import numpy as np
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

if __name__ == "__main__":
    lat1 = 50.0 * np.random.rand(1000000)
    lon1 = 50.0 * np.random.rand(1000000)
    lat2 = 50.0 * np.random.rand(1000000)
    lon2 = 50.0 * np.random.rand(1000000)

    t0 = time.time()
    r1 = haversine(lon1, lat1, lon2, lat2)
    t1 = time.time()
    print t1-t0, r1

    lib_path = "/home/ely/programming/python/numpy_ctypes/haversine.so"
    haversine_lib = ctypes.CDLL(lib_path)
    arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, 
                                           ndim=1, 
                                           flags='CONTIGUOUS')

    haversine_lib.haversine.restype = ctypes.c_int
    haversine_lib.haversine.argtypes = [ctypes.c_size_t,
                                        arr_1d_double, 
                                        arr_1d_double,
                                        arr_1d_double,
                                        arr_1d_double,
                                        arr_1d_double]

    size = len(lat1)
    output = np.empty(size, dtype=np.double)
    print "====="
    print output
    t2 = time.time()
    res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)
    t3 = time.time()
    print t3 - t2, res
    print type(output), output

To run it, which will run and time the Python and ctypesversions separately and print some results, we can just do

要运行它，它将分别运行和计时 Python 和ctypes版本并打印一些结果，我们可以这样做

python haversine.py

which displays:

其中显示：

0.111340045929 [  231.53695005  3042.84915093   169.5158946  ...,  1359.2656769
  2686.87895954  3728.54788207]
=====
[  6.92017600e-310   2.97780954e-316   2.97780954e-316 ...,
   3.20676686e-001   1.31978329e-001   5.15819721e-001]
0.148446083069 0
<type 'numpy.ndarray'> [  231.53675618  3042.84723579   169.51575588 ...,  1359.26453029
  2686.87709456  3728.54493339]

As expected, the numpyversion is slightly faster (0.11 seconds for vectors with length of 1 million) but our quick and dirty ctypesversion is no slouch: a respectable 0.148 seconds on the same data.

正如预期的那样，该numpy版本稍快（对于长度为 100 万的向量为 0.11 秒），但我们快速而肮脏的ctypes版本并没有懈怠：在相同数据上的时间为可观的 0.148 秒。

Let's compare this to a naive for-loop solution in Python:

让我们将其与 Python 中的简单 for 循环解决方案进行比较：

from math import radians, cos, sin, asin, sqrt

def slow_haversine(lon1, lat1, lon2, lat2):
    n = len(lon1)
    kms = np.empty(n, dtype=np.double)
    for i in range(n):
       lon1_v, lat1_v, lon2_v, lat2_v = map(
           radians, 
           [lon1[i], lat1[i], lon2[i], lat2[i]]
       )

       dlon = lon2_v - lon1_v 
       dlat = lat2_v - lat1_v 
       a = (sin(dlat/2)**2 
            + cos(lat1_v) * cos(lat2_v) * sin(dlon/2)**2)
       c = 2 * asin(sqrt(a)) 
       kms[i] = 6367 * c
    return kms

When I put this into the same Python file as the others and time it on the same million-element data, I consistently see a time of around 2.65 seconds on my machine.

当我将其放入与其他文件相同的 Python 文件中并在相同的百万元素数据上计时时，我始终在我的机器上看到大约 2.65 秒的时间。

So by quickly switching to ctypeswe improve the speed by a factor of about 18. For many calculations that can benefit from access to bare, contiguous data, you often see gains much higher even than this.

因此，通过快速切换到ctypes我们将速度提高了大约 18 倍。对于可以从访问裸露的连续数据中受益的许多计算，您经常会看到比这更高的收益。

Just to be super clear, I am not at all endorsing this as a better option than just using numpy. This is precisely the problem that numpywas built to solve, and so homebrewing your own ctypescode whenever it both (a) makes sense to incorporate numpydata types in your application and (b) there exists an easy way to map your code into a numpyequivalent, is not very efficient.

只是要非常清楚，我并不认为这是一个比仅使用numpy. 这正是numpy旨在解决的问题，因此，ctypes只要 (a) 将numpy数据类型合并到您的应用程序中是有意义的，并且 (b) 存在一种将您的代码映射到numpy等效代码的简单方法，就可以自制您自己的代码，是效率不高。

But it's still very helpful to know how to do this for those occasions when you prefer to write something in C yet call it in Python, or situations where a dependence on numpyis not practical (in an embedded system where numpycannot be installed, for example).

但是，当您更喜欢用 C 编写某些东西而在 Python 中调用它时，或者在依赖numpy不切实际的情况下（numpy例如，在无法安装的嵌入式系统中），知道如何执行此操作仍然非常有帮助。.

Answer 3

回答by Kraviz

In case that using scikit-learn is allowed, I would give the following a chance:

如果允许使用 scikit-learn，我会给以下一个机会：

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

# example data
lat1, lon1 = 36.4256345, -5.1510261
lat2, lon2 = 40.4165, -3.7026
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

X = [[lat1, lon1],
     [lat2, lon2]]
kms = 6367
print(kms * dist.pairwise(X))

Answer 4

回答by jpp

A trivial extension to @derricw's vectorised solution, you can use numbato improve performance by ~2x with virtually no change to your code. For pure numerical calculations, this should probably be used for benchmarking / testing versus possibly more efficient solutions.

@derricw向量化解决方案的一个微不足道的扩展，您可以使用它将 numba性能提高约 2倍，而几乎不更改您的代码。对于纯数值计算，这可能应该用于基准测试/测试，而不是可能更有效的解决方案。

from numba import njit

@njit
def haversine_nb(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians(lon1), np.radians(lat1), np.radians(lon2), np.radians(lat2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

Benchmarking versus the Pandas function:

基准测试与 Pandas 函数：

%timeit haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
# 1 loop, best of 3: 1.81 s per loop

%timeit haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)
# 1 loop, best of 3: 921 ms per loop

Full benchmarking code:

完整的基准测试代码：

import pandas as pd, numpy as np
from numba import njit

def haversine_pd(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

@njit
def haversine_nb(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians(lon1), np.radians(lat1), np.radians(lon2), np.radians(lat2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

np.random.seed(0)
lon1, lon2, lat1, lat2 = np.random.randn(4, 10**7)
df = pd.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
km = haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
km_nb = haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)

assert np.isclose(km.values, km_nb).all()

%timeit haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
# 1 loop, best of 3: 1.81 s per loop

%timeit haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)
# 1 loop, best of 3: 921 ms per loop

Answer 5

回答by chris

The vectorized function specifies that "All args must be of equal length". By extending the bounds of the "larger" dataset, according to this, one can efficiently find the distance all i,j pairs of elements.

向量化函数指定“所有参数的长度必须相等”。通过扩展“更大”数据集的边界，根据this，可以有效地找到所有 i,j 对元素的距离。

from random import uniform
import numpy as np

def new_haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1[:,None]

    dlat = lat2 - lat1[:,None]

    a = np.sin(dlat/2.0)**2 + np.cos(lat1[:,None]) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

lon1 = [uniform(-180,180) for n in range(6)]
lat1 = [uniform(-90, 90) for n in range(6)]
lon2 = [uniform(-180,180) for n in range(4)]
lat2 = [uniform(-90, 90) for n in range(4)]

new = new_haversine_np(lon1, lat1, lon2, lat2)

for i in range(6):
    for j in range(4):
        print(i,j,round(new[i,j],2))

Answer 6

回答by Clay

Some of these answers are "rounding" the radius of the earth. If you check these against other distance calculators (such as geopy), these functions will be off.

其中一些答案是“四舍五入”地球的半径。如果您对照其他距离计算器（例如geopy）检查这些，这些功能将被关闭。

You can switch out R=3959.87433for the conversion constant below if you want the answer in miles.

R=3959.87433如果您想要以英里为单位的答案，您可以切换到下面的转换常数。

If you want kilometers, use R= 6372.8.

如果您想要公里，请使用R= 6372.8.

lon1 = -103.548851
lat1 = 32.0004311
lon2 = -103.6041946
lat2 = 33.374939


def haversine(lat1, lon1, lat2, lon2):

      R = 3959.87433 # this is in miles.  For Earth radius in kilometers use 6372.8 km

      dLat = radians(lat2 - lat1)
      dLon = radians(lon2 - lon1)
      lat1 = radians(lat1)
      lat2 = radians(lat2)

      a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
      c = 2*asin(sqrt(a))

      return R * c

print(haversine(lat1, lon1, lat2, lon2))

快速Haversine近似（Python/Pandas）

提问by Nyxynyx

回答by derricw

回答by ely

回答by Kraviz

回答by jpp

回答by chris

回答by Clay

相关推荐

最近更新

标签

快速Haversine近似（Python/Pandas）

提问by Nyxynyx

回答by derricw

回答by ely

回答by Kraviz

回答by jpp

回答by chris

回答by Clay

相关推荐

Pandas：使用 groupby 获取每个数据类别的平均值

我需要你关于 python pandas 中 read_fwf 的帮助

Pandas 和 Matplotlib - fill_between() 与 datetime64

Pandas：按行号和列号访问数据框中的数据

相关推荐

最近更新

标签