pandas 更有效的方法来表示在熊猫数据框中将列的子集居中并保留列名

Question

提问by R_Queery

I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.

我有一个大约有 370 列的数据框。我正在测试一系列假设，这些假设要求我使用模型的子集来拟合三次回归模型。我计划使用 statsmodels 对这些数据进行建模。

Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).

多项式回归过程的一部分涉及均值中心变量（从特定特征的每个案例中减去均值）。

I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.

我可以用 3 行代码做到这一点，但它似乎效率低下，因为我需要为六个假设复制这个过程。请记住，我需要从 statsmodel 输出中获取系数级别的数据，因此我需要保留列名。

Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.

这是数据的一瞥。这是我的假设检验之一所需的列子集。

      i  we  you  shehe  they  ipron
0  0.51   0    0   0.26  0.00   1.02
1  1.24   0    0   0.00  0.00   1.66
2  0.00   0    0   0.00  0.72   1.45
3  0.00   0    0   0.00  0.00   0.53

Here is the code that mean centers and keeps the column names.

这是表示居中并保留列名称的代码。

from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]

#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')

#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)

Any recommendations on how to make this more efficient / faster would be awesome!

关于如何提高效率/更快的任何建议都会很棒！

Answer 1

回答by Stefan

df.apply(lambda x: x-x.mean())

%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop

df.subtract(df.mean())

%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 μs per loop

both yielding:

都产生：

        i  we  you  shehe  they  ipron
0  0.0725   0    0  0.195 -0.18 -0.145
1  0.8025   0    0 -0.065 -0.18  0.495
2 -0.4375   0    0 -0.065  0.54  0.285
3 -0.4375   0    0 -0.065 -0.18 -0.635

Answer 2

回答by Jasper Schwenzow

I know this question is a little old, but by now Scikit is the fastest solution. Plus, you can condense the code in one line:

我知道这个问题有点老了，但现在 Scikit 是最快的解决方案。另外，您可以将代码压缩在一行中：

pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)

%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 μs ± 30.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


test.subtract(df.mean())

%timeit df.subtract(df.mean())
1.63 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df I used for testing:

我用于测试的df：

df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))

pandas 更有效的方法来表示在熊猫数据框中将列的子集居中并保留列名

提问by R_Queery

回答by Stefan

回答by Jasper Schwenzow

相关推荐

最近更新

标签

pandas 更有效的方法来表示在熊猫数据框中将列的子集居中并保留列名

提问by R_Queery

回答by Stefan

回答by Jasper Schwenzow

相关推荐

pandas 用标题将数据框写入excel

Pandas to_sql 如何确定将哪个数据框列放入哪个数据库字段？

pandas 如何使用strptime将浮点数/整数转换为日期？

连接具有相同 ID 的 Pandas DataFrame 行

相关推荐

最近更新

标签