在 Pandas 中向量化函数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27575854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:46:43  来源:igfitidea点击:

Vectorizing a function in pandas

pythonpandasdataframe

提问by DataSwede

I have a dataframe that contains a list of lat/lon coordinates:

我有一个包含纬度/经度坐标列表的数据框:

d = {'Provider ID': {0: '10001',
  1: '10005',
  2: '10006',
  3: '10007',
  4: '10008',
  5: '10011',
  6: '10012',
  7: '10016',
  8: '10018',
  9: '10019'},
 'latitude': {0: '31.215379379000467',
  1: '34.22133455500045',
  2: '34.795039606000444',
  3: '31.292159523000464',
  4: '31.69311635000048',
  5: '33.595265517000485',
  6: '34.44060759100046',
  7: '33.254429322000476',
  8: '33.50314015000049',
  9: '34.74643089500046'},
 'longitude': {0: ' -85.36146587999968',
  1: ' -86.15937514799964',
  2: ' -87.68507485299966',
  3: ' -86.25539902199966',
  4: ' -86.26549483099967',
  5: ' -86.66531866799966',
  6: ' -85.75726760699968',
  7: ' -86.81407933399964',
  8: ' -86.80242858299965',
  9: ' -87.69893502799965'}}
df = pd.DataFrame(d)

My goal is to use the haversine function to figure out the distances between every item in KM:

我的目标是使用 hasrsine 函数来计算 KM 中每个项目之间的距离:

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

My goal is to get a dataframe that looks like the result_df below where the values are the distance between each provider id:

我的目标是获得一个类似于下面的 result_df 的数据框,其中的值是每个提供者 ID 之间的距离:

 result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])

I can do this in a loop, however it's terribly slow. I'm looking for some help in converting this to a vectorized method:

我可以循环执行此操作,但是速度非常慢。我正在寻找将其转换为矢量化方法的一些帮助:

for first_hospital_coordinates in result_df.columns:
    for second_hospital_coordinates in result_df['Provider ID']:
        if first_hospital_coordinates == 'Provider ID':
            pass
        else:
            L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
            O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
            L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
            O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values

            distance = haversine(O1, L1, O2, L2)

            crit = result_df['Provider ID'] == second_hospital_coordinates
            result_df.loc[crit, first_hospital_coordinates] = distance

采纳答案by nitin

To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,

要矢量化此代码,您需要对完整的数据帧进行操作,而不是对单个经纬度进行操作。我对此进行了尝试。我需要结果 df 和一个新函数 h2,

import numpy as np
def h2(df, p):
    inrad = df.applymap(radians)
    dlon = inrad.longitude-inrad.longitude[p]
    dlat = inrad.latitude-inrad.latitude[p]
    lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
    a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
    c = 2 * 1/np.sin(np.sqrt(a))
    km = 6367 * c
    return km

df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
    df2[c] = h2(df, c)

print (df2)

This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)

这应该会产生,(我不确定我是否有正确的答案......我的目标是对代码进行矢量化)

Provider ID         10001         10005         10006         10007  \
Provider ID                                                           
10001                 inf  5.021936e+05  5.270062e+05  1.649088e+06   
10005        5.021936e+05           inf  9.294868e+05  4.985233e+05   
10006        5.270062e+05  9.294868e+05           inf  4.548412e+05   
10007        1.649088e+06  4.985233e+05  4.548412e+05           inf   
10008        1.460299e+06  5.777248e+05  5.246954e+05  3.638231e+06   
10011        6.723581e+05  2.004199e+06  1.027439e+06  6.394402e+05   
10012        4.559090e+05  3.265536e+06  7.573411e+05  4.694125e+05   
10016        7.680036e+05  1.429573e+06  9.105474e+05  7.517467e+05   
10018        7.096548e+05  1.733554e+06  1.020976e+06  6.701920e+05   
10019        5.436342e+05  9.278739e+05  2.891822e+07  4.638858e+05   

Provider ID         10008         10011         10012         10016  \
Provider ID                                                           
10001        1.460299e+06  6.723581e+05  4.559090e+05  7.680036e+05   
10005        5.777248e+05  2.004199e+06  3.265536e+06  1.429573e+06   
10006        5.246954e+05  1.027439e+06  7.573411e+05  9.105474e+05   
10007        3.638231e+06  6.394402e+05  4.694125e+05  7.517467e+05   
10008                 inf  7.766998e+05  5.401081e+05  9.496953e+05   
10011        7.766998e+05           inf  1.341775e+06  4.220911e+06   
10012        5.401081e+05  1.341775e+06           inf  1.119063e+06   
10016        9.496953e+05  4.220911e+06  1.119063e+06           inf   
10018        8.236437e+05  1.242451e+07  1.226941e+06  5.866259e+06   
10019        5.372119e+05  1.051748e+06  7.514774e+05  9.362341e+05   

Provider ID         10018         10019  
Provider ID                              
10001        7.096548e+05  5.436342e+05  
10005        1.733554e+06  9.278739e+05  
10006        1.020976e+06  2.891822e+07  
10007        6.701920e+05  4.638858e+05  
10008        8.236437e+05  5.372119e+05  
10011        1.242451e+07  1.051748e+06  
10012        1.226941e+06  7.514774e+05  
10016        5.866259e+06  9.362341e+05  
10018                 inf  1.048895e+06  
10019        1.048895e+06           inf  

[10 rows x 10 columns]

回答by Paul H

You don't need anything fancy, just a few mods to your function.

您不需要任何花哨的东西,只需对您的功能进行一些修改即可。

First, don't use the mathlibrary. If you're doing real math or science, you're likely better off with numpy.

首先,不要使用math图书馆。如果您正在做真正的数学或科学,那么使用 numpy.

Second, we're going to use the dataframe method apply. What applydoes is it takes a function and runs every row (axis=1) or column (axis=0) through it, and builds a new pandas object with all of the returned values. So we need to set up haversinetotake row of a dataframe and unpack the values. It becomes:

其次,我们将使用 dataframe 方法apply。什么apply所做的是它需要一个函数,并通过它运行中的每一行(轴= 1)或柱(轴= 0),并建立一个新的Pandas与所有返回的值的对象。所以我们需要设置haversinetotake row of a dataframe并解包值。它成为了:

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    import numpy as np

    # convert all of the row to radians
    row = np.radians(row)

    # unpack the values for convenience
    lat1 = row['lat1']
    lat2 = row['lat2']
    lon1 = row['lon1']
    lon2 = row['lon2']

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

Ok, now we need to get your dataframe in shape. In your question, everything is a string and that's not good for doing math. So using your variable d, I said:

好的,现在我们需要让您的数据框成形。在你的问题中,一切都是一个字符串,这不利于数学。所以使用你的变量d,我说:

df = pandas.DataFrame(d).set_index('Provider ID').astype(float)

So that created the dataframe of strings, set the provider as the index, and then converted all of the columns to floats, since we're doing math.

这样就创建了字符串数据框,将提供者设置为索引,然后将所有列转换为浮点数,因为我们正在做数学运算。

Now we need to make rows with two sets of coords. For that we'll use the shiftmethod and join the result to the original dataframe. Doing that all at once looks like this:

现在我们需要用两组坐标制作行。为此,我们将使用该shift方法并将结果连接到原始数据帧。一次完成这一切看起来像这样:

df = df.join(df.shift(), lsuffix='1', rsuffix='2')
print(df.head())

                  lat1       lon1       lat2       lon2
Provider ID                                            
10001        31.215379 -85.361466        NaN        NaN
10005        34.221335 -86.159375  31.215379 -85.361466
10006        34.795040 -87.685075  34.221335 -86.159375
10007        31.292160 -86.255399  34.795040 -87.685075
10008        31.693116 -86.265495  31.292160 -86.255399

rsuffixand lsuffixare what append "1" and "2" to the column names during the join operation.

rsuffixlsuffix是在连接操作期间将“1”和“2”附加到列名的内容。

The "2" columns are from df.shift()and you'll notice that they are equal to the "1" columns of the previous row. Also you'll see that the first row of the "2" columns are NaNsince there's nothing previousto the first row.

“2”列来自df.shift(),您会注意到它们等于前一行的“1”列。您还会看到“2”列的第一行是NaN因为第一行之前没有任何内容 。

So now we can applythe Haversine function:

所以现在我们可以apply使用Haversine函数:

distance = df.apply(haversine, axis=1)
print(distance)
Provider ID
10001                 NaN
10005          342.261590
10006          153.567591
10007          411.393751
10008           44.566642
10011          214.661170
10012          125.775583
10016          163.973219
10018           27.659157
10019          160.901128
dtype: float64

回答by wwii

You should be able to operate on the wholething. Im not very familiar with Pandas so I'll just work with the underlying numpyarrays. Using your data d:

你应该能够对整个事情进行操作。我对 Pandas 不是很熟悉,所以我只会使用底层numpy数组。使用您的数据d

df = pd.DataFrame(d)
df1 = df.astype(float)
a = np.radians(df1.values[:,1:])
# a.shape is 10,2, it contains the Lat/Lon only

# transpose and subtract
# add a new axes so they can be broadcast
diff = a[...,np.newaxis] - a.T
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]

b = np.square(np.sin(diff / 2))
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]

# make this term: cos(Lat1) * cos(Lat2)
cos_Lat = np.cos(a[:,0])
c = cos_Lat * cos_Lat[:, np.newaxis]    # shape 10x10

# sin^2(dLon/2) is b[:,1,:]
b[:,1,:] = b[:,1,:] * c
g = b.sum(axis = 1)
h = 6367000 * 2 * np.arcsin((np.sqrt(g)))   # meters

Back to a pandas.DataFrame

回到一个 pandas.DataFrame

df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)

I didn't try any performance tests. There is a lot of intermediate array creation going on and it can be expensive - using the optional output argument of the ufuncsmight alleviate that.

我没有尝试任何性能测试。有很多中间数组的创建正在进行并且可能很昂贵 - 使用 的可选输出参数ufuncs可能会缓解这一点。

Same thing with in-placeoperations:

就地操作相同:

df = pd.DataFrame(d)
df_A = df.astype(float)
z = df_A.values[:,1:]

# cos(Lat1) * cos(Lat2)
w = np.cos(z[:,0])
w = w * w[:, np.newaxis]    # w.shape is (10,10)

# sin^2(dLat/2) and sin^2(dLon/2)
np.radians(z, z)
z = z[...,np.newaxis] - z.T
np.divide(z, 2, z)
np.sin(z, z)
np.square(z,z)
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]

# cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
np.multiply(z[:,1,:], w, z[:,1,:])

# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
z = z.sum(axis = 1)

np.sqrt(z, z)
np.arcsin(z,z)
np.multiply(z, 6367000 * 2, z)   #meters

df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)