在 Pandas 中向量化函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27575854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Vectorizing a function in pandas
提问by DataSwede
I have a dataframe that contains a list of lat/lon coordinates:
我有一个包含纬度/经度坐标列表的数据框:
d = {'Provider ID': {0: '10001',
1: '10005',
2: '10006',
3: '10007',
4: '10008',
5: '10011',
6: '10012',
7: '10016',
8: '10018',
9: '10019'},
'latitude': {0: '31.215379379000467',
1: '34.22133455500045',
2: '34.795039606000444',
3: '31.292159523000464',
4: '31.69311635000048',
5: '33.595265517000485',
6: '34.44060759100046',
7: '33.254429322000476',
8: '33.50314015000049',
9: '34.74643089500046'},
'longitude': {0: ' -85.36146587999968',
1: ' -86.15937514799964',
2: ' -87.68507485299966',
3: ' -86.25539902199966',
4: ' -86.26549483099967',
5: ' -86.66531866799966',
6: ' -85.75726760699968',
7: ' -86.81407933399964',
8: ' -86.80242858299965',
9: ' -87.69893502799965'}}
df = pd.DataFrame(d)
My goal is to use the haversine function to figure out the distances between every item in KM:
我的目标是使用 hasrsine 函数来计算 KM 中每个项目之间的距离:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
My goal is to get a dataframe that looks like the result_df below where the values are the distance between each provider id:
我的目标是获得一个类似于下面的 result_df 的数据框,其中的值是每个提供者 ID 之间的距离:
result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])
I can do this in a loop, however it's terribly slow. I'm looking for some help in converting this to a vectorized method:
我可以循环执行此操作,但是速度非常慢。我正在寻找将其转换为矢量化方法的一些帮助:
for first_hospital_coordinates in result_df.columns:
for second_hospital_coordinates in result_df['Provider ID']:
if first_hospital_coordinates == 'Provider ID':
pass
else:
L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values
distance = haversine(O1, L1, O2, L2)
crit = result_df['Provider ID'] == second_hospital_coordinates
result_df.loc[crit, first_hospital_coordinates] = distance
采纳答案by nitin
To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,
要矢量化此代码,您需要对完整的数据帧进行操作,而不是对单个经纬度进行操作。我对此进行了尝试。我需要结果 df 和一个新函数 h2,
import numpy as np
def h2(df, p):
inrad = df.applymap(radians)
dlon = inrad.longitude-inrad.longitude[p]
dlat = inrad.latitude-inrad.latitude[p]
lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
c = 2 * 1/np.sin(np.sqrt(a))
km = 6367 * c
return km
df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
df2[c] = h2(df, c)
print (df2)
This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)
这应该会产生,(我不确定我是否有正确的答案......我的目标是对代码进行矢量化)
Provider ID 10001 10005 10006 10007 \
Provider ID
10001 inf 5.021936e+05 5.270062e+05 1.649088e+06
10005 5.021936e+05 inf 9.294868e+05 4.985233e+05
10006 5.270062e+05 9.294868e+05 inf 4.548412e+05
10007 1.649088e+06 4.985233e+05 4.548412e+05 inf
10008 1.460299e+06 5.777248e+05 5.246954e+05 3.638231e+06
10011 6.723581e+05 2.004199e+06 1.027439e+06 6.394402e+05
10012 4.559090e+05 3.265536e+06 7.573411e+05 4.694125e+05
10016 7.680036e+05 1.429573e+06 9.105474e+05 7.517467e+05
10018 7.096548e+05 1.733554e+06 1.020976e+06 6.701920e+05
10019 5.436342e+05 9.278739e+05 2.891822e+07 4.638858e+05
Provider ID 10008 10011 10012 10016 \
Provider ID
10001 1.460299e+06 6.723581e+05 4.559090e+05 7.680036e+05
10005 5.777248e+05 2.004199e+06 3.265536e+06 1.429573e+06
10006 5.246954e+05 1.027439e+06 7.573411e+05 9.105474e+05
10007 3.638231e+06 6.394402e+05 4.694125e+05 7.517467e+05
10008 inf 7.766998e+05 5.401081e+05 9.496953e+05
10011 7.766998e+05 inf 1.341775e+06 4.220911e+06
10012 5.401081e+05 1.341775e+06 inf 1.119063e+06
10016 9.496953e+05 4.220911e+06 1.119063e+06 inf
10018 8.236437e+05 1.242451e+07 1.226941e+06 5.866259e+06
10019 5.372119e+05 1.051748e+06 7.514774e+05 9.362341e+05
Provider ID 10018 10019
Provider ID
10001 7.096548e+05 5.436342e+05
10005 1.733554e+06 9.278739e+05
10006 1.020976e+06 2.891822e+07
10007 6.701920e+05 4.638858e+05
10008 8.236437e+05 5.372119e+05
10011 1.242451e+07 1.051748e+06
10012 1.226941e+06 7.514774e+05
10016 5.866259e+06 9.362341e+05
10018 inf 1.048895e+06
10019 1.048895e+06 inf
[10 rows x 10 columns]
回答by Paul H
You don't need anything fancy, just a few mods to your function.
您不需要任何花哨的东西,只需对您的功能进行一些修改即可。
First, don't use the mathlibrary. If you're doing real math or science, you're likely better off with numpy.
首先,不要使用math图书馆。如果您正在做真正的数学或科学,那么使用 numpy.
Second, we're going to use the dataframe method apply. What applydoes is it takes a function and runs every row (axis=1) or column (axis=0) through it, and builds a new pandas object with all of the returned values. So we need to set up haversinetotake row of a dataframe and unpack the values. It becomes:
其次,我们将使用 dataframe 方法apply。什么apply所做的是它需要一个函数,并通过它运行中的每一行(轴= 1)或柱(轴= 0),并建立一个新的Pandas与所有返回的值的对象。所以我们需要设置haversinetotake row of a dataframe并解包值。它成为了:
def haversine(row):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
import numpy as np
# convert all of the row to radians
row = np.radians(row)
# unpack the values for convenience
lat1 = row['lat1']
lat2 = row['lat2']
lon1 = row['lon1']
lon2 = row['lon2']
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
Ok, now we need to get your dataframe in shape. In your question, everything is a string and that's not good for doing math. So using your variable d, I said:
好的,现在我们需要让您的数据框成形。在你的问题中,一切都是一个字符串,这不利于数学。所以使用你的变量d,我说:
df = pandas.DataFrame(d).set_index('Provider ID').astype(float)
So that created the dataframe of strings, set the provider as the index, and then converted all of the columns to floats, since we're doing math.
这样就创建了字符串数据框,将提供者设置为索引,然后将所有列转换为浮点数,因为我们正在做数学运算。
Now we need to make rows with two sets of coords. For that we'll use the shiftmethod and join the result to the original dataframe. Doing that all at once looks like this:
现在我们需要用两组坐标制作行。为此,我们将使用该shift方法并将结果连接到原始数据帧。一次完成这一切看起来像这样:
df = df.join(df.shift(), lsuffix='1', rsuffix='2')
print(df.head())
lat1 lon1 lat2 lon2
Provider ID
10001 31.215379 -85.361466 NaN NaN
10005 34.221335 -86.159375 31.215379 -85.361466
10006 34.795040 -87.685075 34.221335 -86.159375
10007 31.292160 -86.255399 34.795040 -87.685075
10008 31.693116 -86.265495 31.292160 -86.255399
rsuffixand lsuffixare what append "1" and "2" to the column names during the join operation.
rsuffix和lsuffix是在连接操作期间将“1”和“2”附加到列名的内容。
The "2" columns are from df.shift()and you'll notice that they are equal to the "1" columns of the previous row. Also you'll see that the first row of the "2" columns are NaNsince there's nothing previousto the first row.
“2”列来自df.shift(),您会注意到它们等于前一行的“1”列。您还会看到“2”列的第一行是NaN因为第一行之前没有任何内容 。
So now we can applythe Haversine function:
所以现在我们可以apply使用Haversine函数:
distance = df.apply(haversine, axis=1)
print(distance)
Provider ID
10001 NaN
10005 342.261590
10006 153.567591
10007 411.393751
10008 44.566642
10011 214.661170
10012 125.775583
10016 163.973219
10018 27.659157
10019 160.901128
dtype: float64
回答by wwii
You should be able to operate on the wholething. Im not very familiar with Pandas so I'll just work with the underlying numpyarrays. Using your data d:
你应该能够对整个事情进行操作。我对 Pandas 不是很熟悉,所以我只会使用底层numpy数组。使用您的数据d:
df = pd.DataFrame(d)
df1 = df.astype(float)
a = np.radians(df1.values[:,1:])
# a.shape is 10,2, it contains the Lat/Lon only
# transpose and subtract
# add a new axes so they can be broadcast
diff = a[...,np.newaxis] - a.T
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]
b = np.square(np.sin(diff / 2))
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]
# make this term: cos(Lat1) * cos(Lat2)
cos_Lat = np.cos(a[:,0])
c = cos_Lat * cos_Lat[:, np.newaxis] # shape 10x10
# sin^2(dLon/2) is b[:,1,:]
b[:,1,:] = b[:,1,:] * c
g = b.sum(axis = 1)
h = 6367000 * 2 * np.arcsin((np.sqrt(g))) # meters
Back to a pandas.DataFrame
回到一个 pandas.DataFrame
df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)
I didn't try any performance tests. There is a lot of intermediate array creation going on and it can be expensive - using the optional output argument of the ufuncsmight alleviate that.
我没有尝试任何性能测试。有很多中间数组的创建正在进行并且可能很昂贵 - 使用 的可选输出参数ufuncs可能会缓解这一点。
Same thing with in-placeoperations:
与就地操作相同:
df = pd.DataFrame(d)
df_A = df.astype(float)
z = df_A.values[:,1:]
# cos(Lat1) * cos(Lat2)
w = np.cos(z[:,0])
w = w * w[:, np.newaxis] # w.shape is (10,10)
# sin^2(dLat/2) and sin^2(dLon/2)
np.radians(z, z)
z = z[...,np.newaxis] - z.T
np.divide(z, 2, z)
np.sin(z, z)
np.square(z,z)
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]
# cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
np.multiply(z[:,1,:], w, z[:,1,:])
# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
z = z.sum(axis = 1)
np.sqrt(z, z)
np.arcsin(z,z)
np.multiply(z, 6367000 * 2, z) #meters
df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)

