pandas 更有效的方法来表示在熊猫数据框中将列的子集居中并保留列名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34953988/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
More efficient way to mean center a sub-set of columns in a pandas dataframe and retain column names
提问by R_Queery
I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.
我有一个大约有 370 列的数据框。我正在测试一系列假设,这些假设要求我使用模型的子集来拟合三次回归模型。我计划使用 statsmodels 对这些数据进行建模。
Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).
多项式回归过程的一部分涉及均值中心变量(从特定特征的每个案例中减去均值)。
I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.
我可以用 3 行代码做到这一点,但它似乎效率低下,因为我需要为六个假设复制这个过程。请记住,我需要从 statsmodel 输出中获取系数级别的数据,因此我需要保留列名。
Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.
这是数据的一瞥。这是我的假设检验之一所需的列子集。
i we you shehe they ipron
0 0.51 0 0 0.26 0.00 1.02
1 1.24 0 0 0.00 0.00 1.66
2 0.00 0 0 0.00 0.72 1.45
3 0.00 0 0 0.00 0.00 0.53
Here is the code that mean centers and keeps the column names.
这是表示居中并保留列名称的代码。
from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]
#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')
#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)
Any recommendations on how to make this more efficient / faster would be awesome!
关于如何提高效率/更快的任何建议都会很棒!
回答by Stefan
df.apply(lambda x: x-x.mean())
%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop
df.subtract(df.mean())
%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 μs per loop
both yielding:
都产生:
i we you shehe they ipron
0 0.0725 0 0 0.195 -0.18 -0.145
1 0.8025 0 0 -0.065 -0.18 0.495
2 -0.4375 0 0 -0.065 0.54 0.285
3 -0.4375 0 0 -0.065 -0.18 -0.635
回答by Jasper Schwenzow
I know this question is a little old, but by now Scikit is the fastest solution. Plus, you can condense the code in one line:
我知道这个问题有点老了,但现在 Scikit 是最快的解决方案。另外,您可以将代码压缩在一行中:
pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 μs ± 30.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
test.subtract(df.mean())
%timeit df.subtract(df.mean())
1.63 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df I used for testing:
我用于测试的df:
df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))