pandas 熊猫：DataFrame.sum() 或 DataFrame().as_matrix.sum()

Question

提问by sanguineturtle

I am writing a function that computes conditional probability all columns in a pd.DataFrame that has ~800 columns. I wrote a few versions of the function and found a very big difference in compute time over two primary options:

我正在编写一个函数，该函数计算 pd.DataFrame 中所有列的条件概率，其中包含 ~800 列。我编写了该函数的几个版本，发现两个主要选项的计算时间差异很大：

col_sums = data.sum()   #Simple Column Sum over 800 x 800 DataFrame

Option #1:{'col_sums' and 'data' are a Series and DataFrame respectively}

选项#1：{'col_sums' 和 'data' 分别是 Series 和 DataFrame}

[This is contained within a loop over index1 and index2 to get all combinations]

[这包含在 index1 和 index2 的循环中以获取所有组合]

joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob

Vs.

对比

Option #2:[While looping over index1 and index2 to get get all combinations] Only Difference is instead of using DataFrame I exported the data_matrix to a np.array prior to looping

选项#2：[在遍历 index1 和 index2 以获得所有组合时] 唯一的区别是我在循环之前将 data_matrix 导出到 np.array 而不是使用 DataFrame

new_data = data.T.as_matrix() [Type: np.array]

Option #1 Runtime is ~1700 sec Option #2 Runtime is ~122 sec

选项 #1 运行时间为 ~1700 秒选项 #2 运行时间为 ~122 秒

Questions:

问题：

Is converting the contents of DataFrames to np.array's best for computational tasks?
Is the .sum() routine in pandas significantly different to to .sum() routine in NumPy or is the difference in speed due to the label access to data?
Why are these runtimes so different?

是否将 DataFrames 的内容转换为 np.array 最适合计算任务？
pandas 中的 .sum() 例程与 NumPy 中的 .sum() 例程是否有显着不同，还是由于对数据的标签访问而导致的速度差异？
为什么这些运行时如此不同？

Answer 1

采纳答案by sanguineturtle

While reading the documentation I came across:

在阅读我遇到的文档时：

Section 7.1.1 Fast scalar value getting and settingSince indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to ?gure out what you're asking for. If you only want to access a scalar value, the fastest way is to use the get_value method, which is implemented on all of the data structures:

第 7.1.1 节快速获取和设置标量值由于使用 [] 进行索引必须处理很多情况（单标签访问、切片、布尔索引等），因此它有一些开销以弄清楚您的正在要求。如果只想访问标量值，最快的方法是使用 get_value 方法，该方法在所有数据结构上都实现：

In [656]: s.get_value(dates[5])
Out[656]: -0.67368970808837059
In [657]: df.get_value(dates[5], 'A')
Out[657]: -0.67368970808837059

Best Guess:Because I am accessing individual data elements many times from the dataframe (order of ~640,000 per matrix). I think the speed reduction came from how I referenced the data (i.e. "indexing with [] handles a lot of cases") and therefore I should be using the get_value() method for accessing scalars similar to a matrix lookup.

最佳猜测：因为我从数据框中多次访问单个数据元素（每个矩阵约 640,000 次）。我认为速度降低来自我如何引用数据（即“用 [] 索引处理很多情况”），因此我应该使用 get_value() 方法来访问类似于矩阵查找的标量。

pandas 熊猫：DataFrame.sum() 或 DataFrame().as_matrix.sum()

提问by sanguineturtle

采纳答案by sanguineturtle

相关推荐

最近更新

标签

pandas 熊猫：DataFrame.sum() 或 DataFrame().as_matrix.sum()

提问by sanguineturtle

采纳答案by sanguineturtle

相关推荐

将 Pandas group by object 转换为多索引 Dataframe

在 Pandas 数据框中查找连续段

pandas 带有熊猫的 OLS：日期时间索引作为预测器

如何删除 Pandas 系列重复索引的额外副本？

相关推荐

最近更新

标签