为什么 Pandas 默认遍历 DataFrame 列？

Question

提问by trvrm

Trying to understand the design rationale behind some of Pandas' features.

试图了解 Pandas 某些功能背后的设计原理。

If I have a DataFrame with 3560 rows and 18 columns, then

如果我有一个 3560 行 18 列的 DataFrame，那么

len(frame)

is 3560, but

是 3560，但是

len([a for a in frame])

is 18.

是 18。

Maybe this feels natural to someone coming from R; to me it doesn't feel very 'Pythonic'. Is there an introduction to the underlying design rationales for Pandas somewhere?

也许这对于来自 R 的人来说感觉很自然；对我来说，它感觉不是很“Pythonic”。是否在某处介绍了 Pandas 的基本设计原理？

Answer 1

回答by unutbu

A DataFrame is primarily a column-based data structure. Under the hood, the data inside the DataFrame is stored in blocks. Roughly speaking there is one block for each dtype. Each column has one dtype. So accessing a column can be done by selecting the appropriate column from a single block. In contrast, selecting a single row requires selecting the appropriate row from each block and then forming a new Series and copying the data from each block's row into the Series. Thus, iterating through rows of a DataFrame is (under the hood) not as natural a process as iterating through columns.

DataFrame 主要是基于列的数据结构。在幕后，DataFrame 内的数据存储在块中。粗略地说，每个 dtype 有一个块。 每列都有一个 dtype。因此，可以通过从单个块中选择适当的列来访问列。相比之下，选择单行需要从每个块中选择合适的行，然后形成一个新的系列并将每个块行中的数据复制到系列中。因此，遍历 DataFrame 的行（在幕后）不像遍历列那样自然。

If you need to iterate through the rows, you still can, however, by calling df.iterrows(). You should avoid using df.iterrowsif possible for the same reason why it's unnatural -- it requires copying which makes the process slower than iterating through columns.

但是，如果您需要遍历行，您仍然可以通过调用df.iterrows(). 您应该尽可能避免使用df.iterrows，原因与它不自然的原因相同——它需要复制，这使得该过程比遍历列更慢。

Answer 2

回答by chrisb

There's a decent explanation in the docs- iteration for Pandas DataFrames is meant to be "dict-like," so the iteration is over the keys (the columns).

文档中有一个不错的解释- Pandas DataFrames 的迭代意味着“类似 dict”，所以迭代是在键（列）上进行的。

Arguably it's a little confusing that iteration for Series is over the values, but as the docs note, that's because they are are more "array-like".

可以说，系列的迭代超过值有点令人困惑，但正如文档所指出的那样，这是因为它们更“类似数组”。

为什么 Pandas 默认遍历 DataFrame 列？

提问by trvrm

回答by unutbu

回答by chrisb

相关推荐

最近更新

标签

为什么 Pandas 默认遍历 DataFrame 列？

提问by trvrm

回答by unutbu

回答by chrisb

相关推荐

Pandas：为什么 pandas.Series.std() 与 numpy.std() 不同

pandas 使用 Seaborn FacetGrid 绘制时间序列

pandas 使用 np.where 但如果条件为 False 则保持现有值

pandas 从对象创建数据框

相关推荐

最近更新

标签