pandas 熊猫:DataFrame.sum() 或 DataFrame().as_matrix.sum()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14847551/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: DataFrame.sum() or DataFrame().as_matrix.sum()
提问by sanguineturtle
I am writing a function that computes conditional probability all columns in a pd.DataFrame that has ~800 columns. I wrote a few versions of the function and found a very big difference in compute time over two primary options:
我正在编写一个函数,该函数计算 pd.DataFrame 中所有列的条件概率,其中包含 ~800 列。我编写了该函数的几个版本,发现两个主要选项的计算时间差异很大:
col_sums = data.sum() #Simple Column Sum over 800 x 800 DataFrame
Option #1:{'col_sums' and 'data' are a Series and DataFrame respectively}
选项#1:{'col_sums' 和 'data' 分别是 Series 和 DataFrame}
[This is contained within a loop over index1 and index2 to get all combinations]
[这包含在 index1 和 index2 的循环中以获取所有组合]
joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob
Vs.
对比
Option #2:[While looping over index1 and index2 to get get all combinations] Only Difference is instead of using DataFrame I exported the data_matrix to a np.array prior to looping
选项#2:[在遍历 index1 和 index2 以获得所有组合时] 唯一的区别是我在循环之前将 data_matrix 导出到 np.array 而不是使用 DataFrame
new_data = data.T.as_matrix() [Type: np.array]
Option #1 Runtime is ~1700 sec Option #2 Runtime is ~122 sec
选项 #1 运行时间为 ~1700 秒 选项 #2 运行时间为 ~122 秒
Questions:
问题:
- Is converting the contents of DataFrames to np.array's best for computational tasks?
- Is the .sum() routine in pandas significantly different to to .sum() routine in NumPy or is the difference in speed due to the label access to data?
- Why are these runtimes so different?
- 是否将 DataFrames 的内容转换为 np.array 最适合计算任务?
- pandas 中的 .sum() 例程与 NumPy 中的 .sum() 例程是否有显着不同,还是由于对数据的标签访问而导致的速度差异?
- 为什么这些运行时如此不同?
采纳答案by sanguineturtle
While reading the documentation I came across:
在阅读我遇到的文档时:
Section 7.1.1 Fast scalar value getting and settingSince indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to ?gure out what you're asking for. If you only want to access a scalar value, the fastest way is to use the get_value method, which is implemented on all of the data structures:
第 7.1.1 节 快速获取和设置标量值由于使用 [] 进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此它有一些开销以弄清楚您的正在要求。如果只想访问标量值,最快的方法是使用 get_value 方法,该方法在所有数据结构上都实现:
In [656]: s.get_value(dates[5])
Out[656]: -0.67368970808837059
In [657]: df.get_value(dates[5], 'A')
Out[657]: -0.67368970808837059
Best Guess:Because I am accessing individual data elements many times from the dataframe (order of ~640,000 per matrix). I think the speed reduction came from how I referenced the data (i.e. "indexing with [] handles a lot of cases") and therefore I should be using the get_value() method for accessing scalars similar to a matrix lookup.
最佳猜测:因为我从数据框中多次访问单个数据元素(每个矩阵约 640,000 次)。我认为速度降低来自我如何引用数据(即“用 [] 索引处理很多情况”),因此我应该使用 get_value() 方法来访问类似于矩阵查找的标量。

