Python Numpy 和 Pandas 在计算上有区别吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21567842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:09:27  来源:igfitidea点击:

Is there a difference in computation for Numpy vs Pandas?

pythonnumpypandas

提问by Terence Chow

I've written a bunch of code on the assumption that I was going to use Numpy arrays. Turns out the data I am getting is loaded through Pandas. I remember now that I loaded it in Pandas because I was having some problems loading it in Numpy. I believe the data was just too large.

我已经编写了一堆代码,假设我将使用 Numpy 数组。原来我得到的数据是通过 Pandas 加载的。我现在记得我在 Pandas 中加载它,因为我在 Numpy 中加载它时遇到了一些问题。我相信数据太大了。

Therefore I was wondering, is there a difference in computational ability when using Numpy vs Pandas?

因此我想知道,在使用 Numpy 和 Pandas 时,计算能力是否存在差异?

If Pandas is more efficient then I would rather rewrite all my code for Pandas but if there is no more efficiency then I'll just use a numpy array...

如果 Pandas 更高效,那么我宁愿为 Pandas 重写我的所有代码,但如果没有更高的效率,那么我将只使用一个 numpy 数组......

采纳答案by Mark

There can be a significant performance difference, of an order of magnitude for multiplications and multiple orders of magnitude for indexing a few random vaues.

可能存在显着的性能差异,一个数量级的乘法和多个数量级的索引一些随机值。

I was actually wondering about the same thing and came across this interesting comparison: http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/

我实际上想知道同样的事情并遇到了这个有趣的比较:http: //penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/

回答by Gaurav

I think it's more about using the two strategically and shifting data around (from numpy to pandas or vice versa) based on the performance you see. As a recent example, I was trying to concatenate 4 small pickle files with 10k rows each data.shape -> (10,000, 4)using numpy.

我认为更多的是根据您看到的性能,战略性地使用这两者并转移数据(从 numpy 到 Pandas,反之亦然)。作为最近的一个例子,我试图data.shape -> (10,000, 4)使用 numpy连接 4 个小泡菜文件,每个文件有 10k 行。

Code was something like:

代码是这样的:

n_concat = np.empty((0,4))
for file_path in glob.glob('data/0*', recursive=False):
    n_data = joblib.load(file_path)
    n_concat = np.vstack((co_np, filtered_snp))
joblib.dump(co_np, 'data/save_file.pkl', compress = True)

This crashed my laptop (8 GB, i5) which was surprising since the volume wasn't really thathuge. The 4 compressed pickled files were roughly around 5 MB each.

这坠毁我的笔记本电脑(8 GB,I5)这是令人惊讶,因为成交量是不是真的巨大的。4 个压缩的腌制文件每个大约 5 MB。

The same thing, worked great on pandas.

同样的事情,在熊猫上效果很好。

for file_path in glob.glob('data/0*', recursive=False):
    n_data = joblib.load(sd)
    try:
        df = pd.concat([df, pd.DataFrame(n_data, columns = [...])])
    except NameError:
        df = pd.concat([pd.DataFrame(n_data,columns = [...])])
joblib.dump(df, 'data/save_file.pkl', compress = True)

One the other hand, when I was implementing gradient descent by iterating over a pandas data frame, it was horribly slow, while using numpy for the job was much quicker.

另一方面,当我通过迭代 Pandas 数据帧来实现梯度下降时,速度非常慢,而使用 numpy 来完成这项工作要快得多。

In general, I've seen that pandas usually works better for moving around/munging moderately large chunks of data and doing common column operations while numpy works best for vectorized and recursive work (maybe more math intense work) over smaller sets of data.

总的来说,我已经看到 Pandas 通常更适合移动/处理中等大小的数据块和执行常见的列操作,而 numpy 最适合在较小的数据集上进行矢量化和递归工作(可能是数学密集型工作)。

Moving data between the two is hassle free, so I guess, using both strategically is the way to go.

在两者之间移动数据很容易,所以我想,战略性地使用两者是要走的路。