何时使用 Pandas 系列、numpy ndarrays 或简单的 Python 字典?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45285743/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
When to use pandas series, numpy ndarrays or simply python dictionaries?
提问by Rodolfo Orozco
I am new to learning Python, and some of its libraries (numpy, pandas).
我是学习 Python 及其一些库(numpy、pandas)的新手。
I have found a lot of documentation on hownumpy ndarrays, pandas series and python dictionaries work.
我找到了很多关于numpy ndarrays、pandas 系列和 python 字典如何工作的文档。
But owing to my inexperience with Python, I have had a really hard time determining whento use each one of them. And I haven't found any best-practices that will help me understand and decide when it is better to use each type of data structure.
但是由于我对 Python 缺乏经验,我很难确定何时使用它们中的每一个。而且我还没有找到任何可以帮助我理解和决定何时最好使用每种类型的数据结构的最佳实践。
As a general matter, are there any best practices for deciding which, if any, of these three data structures a specific data set should be loaded into?
一般来说,是否有任何最佳实践来决定特定数据集应该加载到这三种数据结构中的哪一种(如果有的话)?
Thanks!
谢谢!
回答by Xukrao
The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:
我通常应用的经验法则:使用仍然满足您需求的最简单的数据结构。如果我们将数据结构从最简单到最不简单排序,结果通常是这样的:
- Dictionaries / lists
- Numpy arrays
- Pandas series / dataframes
- 字典/列表
- Numpy 数组
- Pandas 系列/数据框
So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:
所以首先考虑字典/列表。如果这些允许您执行所需的所有数据操作,那么一切都很好。如果没有,请开始考虑 numpy 数组。迁移到 numpy 数组的一些典型原因是:
- Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
- You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.
- 您的数据是二维(或更高)。尽管嵌套字典/列表可用于表示多维数据,但在大多数情况下,numpy 数组会更有效。
- 您必须执行大量数值计算。正如zhqiat已经指出的那样,在这种情况下 numpy 将提供显着的加速。此外,numpy 数组与大量数学函数捆绑在一起。
Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:
然后还有一些典型的原因超出 numpy 数组和更复杂但也更强大的 Pandas 系列/数据帧:
- You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagramgives a nice overview of all the 'data wrangling' operations that pandas allows you to do.
- You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functionsfor this.
回答by Joao Paulo Nogueira
If you want to an answer which tells you to stick with just one type of data structures, here goes one: use pandas series/dataframe structures.
如果你想要一个告诉你只使用一种数据结构的答案,这里有一个:使用Pandas系列/数据帧结构。
The pandas series object can be seen as an enhanced numpy 1D array and the pandas dataframe can be seen as an enhanced numpy 2D array. The main difference is that pandas series and pandas dataframes has explicit index, while numpy arrays has implicit indexation. So, in any python code that you think to use something like
pandas 系列对象可以看作是一个增强的 numpy 1D 数组,pandas 数据框可以看作是一个增强的 numpy 2D 数组。主要区别在于pandas series和pandas dataframes具有显式索引,而numpy数组具有隐式索引。因此,在您认为使用类似的任何 Python 代码中
import numpy as np
a = np.array([1,2,3])
you can just use
你可以使用
import pandas as pd
a = pd.Series([1,2,3])
All the functions and methods from numpy arrays will work with pandas series. In analogy, the same can be done with dataframes and numpy 2D arrays.
numpy 数组中的所有函数和方法都适用于 pandas 系列。类似地,数据帧和 numpy 2D 数组也可以这样做。
A further question you might have can be about the performance differences between a numpy array and pandas series. Here is a post that shows the differences in performance using these two tools: performance of pandas series vs numpy arrays.
您可能遇到的另一个问题可能是关于 numpy 数组和 Pandas 系列之间的性能差异。这是一篇文章,展示了使用这两种工具的性能差异:pandas series vs numpy arrays 的性能。
Please note that even in a explicy way pandas series has a subtle worse in performance when compared to numpy, you can solve this by just calling the values method on a pandas series:
请注意,即使以一种明确的方式,与 numpy 相比,pandas 系列的性能也略差,您只需调用 pandas 系列的 values 方法即可解决此问题:
a.values
The result of apply the values method on a pandas series will be a numpy array!
在Pandas系列上应用 values 方法的结果将是一个 numpy 数组!
回答by zhqiat
Pandasin general is used for financial time series data/economics data (it has a lot of built in helpers to handle financial data).
Pandas通常用于金融时间序列数据/经济数据(它有很多内置的帮助程序来处理金融数据)。
Numpyis a fast way to handle large arrays multidimensional arrays for scientific computing (scipy also helps). It also has easy handling for what are called sparse arrays (large arrays with very little data in them).
Numpy是一种处理用于科学计算的大型数组多维数组的快速方法(scipy 也有帮助)。它还可以轻松处理所谓的稀疏数组(包含很少数据的大数组)。
One of key advantages of numpy is the C bindings that allow for massive speeds ups in large array computation along with some built in functions for things like linear algebra/ signal processing capabilities.
numpy 的主要优势之一是 C 绑定,它允许在大型数组计算中大幅加速,以及一些内置函数,例如线性代数/信号处理能力。
Both packages address some of the deficiencies that were identified with the existing built-in data types with python. As a general rule of thumb, with incomplete real world data (NaNs, outliers, etc), you will end up needing to write all types of functions that address these issues; with the above packages you can build on the work of others. If your program is generating the data for your data type internally, you can probably use the more simplistic native data structures (not just python dictionaries).
这两个包都解决了现有的 Python 内置数据类型所发现的一些缺陷。作为一般经验法则,对于不完整的现实世界数据(NaN、异常值等),您最终将需要编写所有类型的函数来解决这些问题;使用上述软件包,您可以建立在其他人的工作基础上。如果您的程序在内部为您的数据类型生成数据,您可能可以使用更简单的本机数据结构(不仅仅是 Python 字典)。
See the postby the author of Pandas for some comparison
看Pandas作者的帖子进行一些比较
回答by CrazyElf
Numpy is very fast with arrays, matrix, math. Pandas series have indexes, sometimes it's very useful to sort or join data. Dictionaries is a slow beast, but sometimes it's very handy too. So, as it was already was mentioned, it depends on use case which data types and tools to use.
Numpy 处理数组、矩阵、数学的速度非常快。Pandas 系列有索引,有时对数据进行排序或连接非常有用。字典是一个缓慢的野兽,但有时它也非常方便。因此,正如已经提到的,这取决于用例要使用的数据类型和工具。