将函数应用于 Pandas 中的列集，按列“循环”整个数据框

Question

提问by Astrid

Here is a test example to show what I am trying to achieve. Here's a toy data frame:

这是一个测试示例，用于展示我想要实现的目标。这是一个玩具数据框：

df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

Which gives

这使

    Time       A_x       A_y       A_z       B_x       B_y       B_z
1  -0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
2  -0.206369 -0.112098 -1.122609  0.218538 -0.878985  0.566872 -1.048862
3  -0.194552  0.818276 -1.563931  0.097377  1.641384 -0.766217 -1.482096
4   0.502731  0.766515 -0.650482 -0.087203 -0.089075  0.443969  0.354747
5   1.411380 -2.419204 -0.882383  0.005204 -0.204358 -0.999242 -0.395236
6   1.036695  1.115630  0.081825 -1.038442  0.515798 -0.060016  2.669702
7   0.392943  0.226386  0.039879  0.732611 -0.073447  1.164285  1.034357
8  -1.253264  0.389148  0.158289  0.440282 -1.195860  0.872064  0.906377
9  -0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
10  0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863

What I want to do is simply to calculate the length of the vector for each header (A and B) in this case, for each index, and divide by the Timecolumn. Hence, this function needs to be np.sqrt(A_x^2 + A_y^2 + A_z^2)and the same for B of course. I.e. I am looking to calculate the velocity for each row, but three columns contribute to one velocity result.

在这种情况下，我想要做的只是计算每个标题（A 和 B）的向量长度，每个索引，然后除以Time列。因此，这个函数当然需要np.sqrt(A_x^2 + A_y^2 + A_z^2)和 B 相同。即我想计算每一行的速度，但三列有助于一个速度结果。

I have tried using df.groupbyand df.filterto loop-over the columns but I cannot really get it to work, because I am not at all sure how I apply effectively the same function to chunks of the data-frame, all in one go (as apparently one is to avoid looping over rows). I have tried doing

我曾尝试使用df.groupby和df.filter循环列，但我无法真正让它工作，因为我完全不确定如何将相同的功能有效地应用于数据框的块，一次性（显然是一个是为了避免在行上循环）。我试过做

df = df.apply(lambda x: np.sqrt(x.dot(x)), axis=1)

This works of course, but only if the input data frame has the right number of columns (3), if longer then the dot-product is calculated over the entire row, and not in chunks of three columns which is what I want (because this is turns corresponds to the tag coordinates, which are three dimensional).

这当然有效，但前提是输入数据框具有正确的列数 (3)，如果更长，则在整行上计算点积，而不是我想要的三列块（因为这是与标签坐标相对应的转数，它们是三维的）。

So this is what I am eventually trying to get with the above example (the below arrays are just filled with random numbers, not the actual velocities which I am trying to calculate - just to show what sort of shape I trying to achieve):

所以这就是我最终试图用上面的例子得到的（下面的数组只是用随机数填充，而不是我试图计算的实际速度 - 只是为了显示我试图实现的形状）：

     Velocity_A  Velocity_B
1    -0.975633   -2.669544
2     0.766405   -0.264904
3     0.425481   -0.429894
4    -0.437316    0.954006
5     1.073352   -1.475964
6    -0.647534    0.937035
7     0.082517    0.438112
8    -0.387111   -1.417930
9    -0.111011    1.068530
10    0.451979   -0.053333

My actual data is 50,000 x 36 (so there are 12 tags with x,y,z coordinates), and I want to calculate the velocity all in one go to avoid iterating (if at all possible). There is also a time column of the same length (50,000x1).

我的实际数据是 50,000 x 36（所以有 12 个带有 x、y、z 坐标的标签），我想一次性计算速度以避免迭代（如果可能的话）。还有一个相同长度的时间列 (50,000x1)。

How do you do this?

你怎么做到这一点？

Thanks, Astrid

谢谢，阿斯特丽德

Answer 1

回答by Nipun Batra

A possible start.

一个可能的开始。

Filtering out column names corresponding to a particular vector. For example

过滤掉与特定向量对应的列名称。例如

In [20]: filter(lambda x: x.startswith("A_"),df.columns)
Out[20]: ['A_x', 'A_y', 'A_z']

Sub selecting these columns from the DataFrame

从 DataFrame 中选择这些列

In [22]: df[filter(lambda x: x.startswith("A_"),df.columns)]
Out[22]: 
         A_x       A_y       A_z
1  -0.123527 -0.547239 -0.453707
2  -0.112098 -1.122609  0.218538
3   0.818276 -1.563931  0.097377
4   0.766515 -0.650482 -0.087203
5  -2.419204 -0.882383  0.005204
6   1.115630  0.081825 -1.038442
7   0.226386  0.039879  0.732611
8   0.389148  0.158289  0.440282
9  -0.308314 -0.839347 -0.517989
10  0.473552  0.059428  0.726088

So, using this technique you can get chunks of 3 columns. For example.

因此，使用这种技术，您可以获得 3 列的块。例如。

column_initials = ["A","B"]
for column_initial in column_initials:
    df["Velocity_"+column_initial]=df[filter(lambda x: x.startswith(column_initial+"_"),df.columns)].apply(lambda x: np.sqrt(x.dot(x)), axis=1)/df.Time


In [32]: df[['Velocity_A','Velocity_B']]
Out[32]: 
    Velocity_A  Velocity_B
1    -9.555311  -22.467965
2    -5.568487   -7.177625
3    -9.086257  -12.030091
4     2.007230    1.144208
5     1.824531    0.775006
6     1.472305    2.623467
7     1.954044    3.967796
8    -0.485576   -1.384815
9    -7.736036   -6.722931
10    1.392823    5.369757

I do not get the same answer as yours. But, I borrowed your df.apply(lambda x: np.sqrt(x.dot(x)), axis=1)and assume it is correct.

我没有得到和你一样的答案。但是，我借用了你的df.apply(lambda x: np.sqrt(x.dot(x)), axis=1)并认为它是正确的。

Hope this helps.

希望这可以帮助。

Answer 2

回答by unutbu

Your calculation is more NumPy-ish than Panda-ish, by which I mean the calculation can be expressed somewhat succinctly if you regard your DataFrame as merely a big array, whereas the solution (at least the one I came up with) is more complicated when you try to wrangle the DataFrame with melt, groupby, etc.

您的计算比 Panda-ish 更像 NumPy，我的意思是，如果您将 DataFrame 视为一个大数组，则计算可以稍微简洁地表达，而解决方案（至少是我想出的解决方案）更复杂当您尝试使用melt、groupby 等处理DataFrame 时。

The entire calculation can be expressed in essentially one line:

整个计算基本上可以用一行表示：

np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]

So here is the NumPy way:

所以这是 NumPy 的方式：

import numpy as np
import pandas as pd
import io
content = '''
Time       A_x       A_y       A_z       B_x       B_y       B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
-0.206369 -0.112098 -1.122609  0.218538 -0.878985  0.566872 -1.048862
-0.194552  0.818276 -1.563931  0.097377  1.641384 -0.766217 -1.482096
 0.502731  0.766515 -0.650482 -0.087203 -0.089075  0.443969  0.354747
 1.411380 -2.419204 -0.882383  0.005204 -0.204358 -0.999242 -0.395236
 1.036695  1.115630  0.081825 -1.038442  0.515798 -0.060016  2.669702
 0.392943  0.226386  0.039879  0.732611 -0.073447  1.164285  1.034357
-1.253264  0.389148  0.158289  0.440282 -1.195860  0.872064  0.906377
-0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
 0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863'''

df = pd.read_table(io.BytesIO(content), sep='\s+', header=True)

arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)

which yields

这产生

   Velocity_A  Velocity_B
0   -9.555311  -22.467965
1   -5.568487   -7.177625
2   -9.086257  -12.030091
3    2.007230    1.144208
4    1.824531    0.775006
5    1.472305    2.623467
6    1.954044    3.967796
7   -0.485576   -1.384815
8   -7.736036   -6.722931
9    1.392823    5.369757

Since your actual DataFrame has shape (50000, 36), choosing a quick method may be important. Here is a benchmark:

由于您的实际 DataFrame 具有形状 (50000, 36)，因此选择快速方法可能很重要。这是一个基准：

import numpy as np
import pandas as pd
import string

N = 12
col_ids = string.letters[:N]
df = pd.DataFrame(
    np.random.randn(50000, 3*N+1), 
    columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids
                      for coord in list('xyz')])


def using_numpy(df):
    arr = df.values
    times = arr[:,0]
    arr = arr[:,1:]
    result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
    result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in col_ids])
    return result

def using_loop(df):
    results = pd.DataFrame(index=df.index) # the result container
    for id in col_ids:
        results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time
    return results

Using IPython:

使用IPython：

In [43]: %timeit using_numpy(df)
10 loops, best of 3: 34.7 ms per loop

In [44]: %timeit using_loop(df)
10 loops, best of 3: 82 ms per loop

Answer 3

回答by K.-Michael Aye

I would do at least a loop over the tag identifier, but don't worry, that's a very fast loop that just determines the filter pattern to get the right columns:

我至少会对标签标识符做一个循环，但别担心，这是一个非常快速的循环，它只确定过滤器模式以获得正确的列：

df = pd.DataFrame(np.random.randn(10,7), index=range(1,11), columns='Time A_x A_y A_z B_x B_y B_z'.split())

col_ids = ['A', 'B'] # I guess you can create that one easily

results = pd.DataFrame(index=df.index) # the result container

for id in col_ids:
    results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time

Answer 4

回答by 8one6

One liner...split over many lines for readability:

一个班轮......分成多行以提高可读性：

import numpy as np
import pandas as pd

np.random.seed(0)

df = pd.DataFrame(
        np.random.randn(10,7), 
        index=range(1,11), 
        columns='Time A_x A_y A_z B_x B_y B_z'.split()
        )

result = df\
    .loc[:, df.columns.values!='Time']\
    .T\
    .groupby(lambda x: x[0])\
    .apply(lambda x: np.sqrt((x ** 2).sum()))\
    .T\
    .apply(lambda x: x / df['Time'])

print result

            A          B
1    1.404626   1.310639
2   -2.954644 -10.874091
3    3.479836   6.105961
4    3.885530   2.244544
5    0.995012   1.434228
6   11.278208  11.454466
7   -1.209242  -1.281165
8   -5.175911  -5.905070
9   11.889318  16.758958
10  -0.978014  -0.590767

Note: I am a bit frustrated that I needed to thrown in the two transposes. I just couldn't get groupbyand applyto play nicely with axis=1. If someone could show me how to do that, I'd be very grateful. The trick here was knowing that when you call groupby(lambda x: f(x))that xis the value of the index for each row. So groupby(lambda x: x[0])groups by the first letter of the row index. After doing the transposition, this was Aor B.

注意：我有点沮丧，因为我需要投入两个转置。我只是无法获得groupby并apply与axis=1. 如果有人能告诉我如何做到这一点，我将不胜感激。这里的技巧是知道当您调用它时，groupby(lambda x: f(x))它x是每行索引的值。所以groupby(lambda x: x[0])按行索引的第一个字母分组。进行换位后，这是A或B。

Ok, no more transposes:

好的，不再转置：

result = df\
    .loc[:, df.columns!='Time']\
    .groupby(lambda x: x[0], axis=1)\
    .apply(lambda x: np.sqrt((x**2).sum(1)))\
    .apply(lambda x: x / df['Time'])

print result

            A          B
1    1.404626   1.310639
2   -2.954644 -10.874091
3    3.479836   6.105961
4    3.885530   2.244544
5    0.995012   1.434228
6   11.278208  11.454466
7   -1.209242  -1.281165
8   -5.175911  -5.905070
9   11.889318  16.758958
10  -0.978014  -0.590767

将函数应用于 Pandas 中的列集，按列“循环”整个数据框

提问by Astrid

回答by Nipun Batra

回答by unutbu

回答by K.-Michael Aye

回答by 8one6

相关推荐

最近更新

标签

将函数应用于 Pandas 中的列集，按列“循环”整个数据框

提问by Astrid

回答by Nipun Batra

回答by unutbu

回答by K.-Michael Aye

回答by 8one6

相关推荐

简单定制 matplotlib/pandas 条形图（标签、刻度等）

pandas 如何在pandas DataFrame中选择和删除具有重复名称的列

pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理

Pandas：如何在 pd.DataFrame.plot() 中在 x 轴上显示次要网格线

相关推荐

最近更新

标签