Python 如何获取熊猫 DataFrame 的行数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15943769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:25:57  来源:igfitidea点击:

How do I get the row count of a pandas DataFrame?

pythonpandasdataframe

提问by yemu

I'm trying to get the number of rows of dataframe df with Pandas, and here is my code.

我正在尝试使用 Pandas 获取数据帧 df 的行数,这是我的代码。

Method 1:

方法一:

total_rows = df.count
print total_rows +1

Method 2:

方法二:

total_rows = df['First_columnn_label'].count
print total_rows +1

Both the code snippets give me this error:

两个代码片段都给了我这个错误:

TypeError: unsupported operand type(s) for +: 'instancemethod' and 'int'

类型错误:不支持 + 的操作数类型:'instancemethod' 和 'int'

What am I doing wrong?

我究竟做错了什么?

采纳答案by root

You can use the .shapeproperty or just len(DataFrame.index). However, there are notable performance differences ( len(DataFrame.index)is fastest):

您可以使用该.shape属性或仅使用len(DataFrame.index). 但是,存在显着的性能差异(len(DataFrame.index)最快):

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.arange(12).reshape(4,3))

In [4]: df
Out[4]: 
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8
3  9  10 11

In [5]: df.shape
Out[5]: (4, 3)

In [6]: timeit df.shape
2.77 μs ± 644 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: timeit df[0].count()
348 μs ± 1.31 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: len(df.index)
Out[8]: 4

In [9]: timeit len(df.index)
990 ns ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

enter image description here

在此处输入图片说明

EDIT: As @Dan Allen noted in the comments len(df.index)and df[0].count()are not interchangeable as countexcludes NaNs,

编辑:作为@丹·艾伦在评论中指出len(df.index),并df[0].count()不能作为互换count不包括NaNS,

回答by Dr. Jan-Philip Gehrcke

Use len(df). This works as of pandas 0.11 or maybe even earlier.

使用len(df). 这适用于熊猫 0.11 或更早的版本。

__len__()is currently (0.12) documented with Returns length of index. Timing info, set up the same way as in root's answer:

__len__()目前 (0.12) 用Returns length of index. 时间信息,设置方式与 root 的回答相同:

In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop

In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop

Due to one additional function call it is a bit slower than calling len(df.index)directly, but this should not play any role in most use cases.

由于有一个额外的函数调用,它比len(df.index)直接调用要慢一些,但这在大多数用例中不应该发挥任何作用。

回答by Nik

Apart from above answers use can use df.axesto get the tuple with row and column indexes and then use len()function:

除了上面的答案,使用 candf.axes来获取具有行和列索引的元组,然后使用len()函数:

total_rows=len(df.axes[0])
total_cols=len(df.axes[1])

回答by Nasir Shah

Suppose dfis your dataframe then:

假设df是你的数据框:

count_row = df.shape[0]  # gives number of row count
count_col = df.shape[1]  # gives number of col count

Or, more succinctly,

或者,更简洁地说,

r, c = df.shape

回答by Memin

TL;DR

TL; 博士

use len(df)

len(df)



len()is your friend, it can be used for row counts as len(df).

len()是你的朋友,它可以用于行计数len(df)

Alternatively, you can access all rows by df.indexand all columns by df.columns, and as you can use the len(anyList)for getting the count of list, use len(df.index)for getting the number of rows, and len(df.columns)for the column count.

或者,您可以通过 访问所有行df.index和所有列 df.columns,并且可以使用len(anyList)获取列表计数,使用 len(df.index)获取行数和len(df.columns)列数。

Or, you can use df.shapewhich returns the number of rows and columns together, if you want to access the number of rows only use df.shape[0]and for the number of columns only use: df.shape[1].

或者,您可以使用df.shapewhich 返回行数和列数,如果您想访问仅使用的行df.shape[0]数和仅使用列数:df.shape[1]

回答by Catbuilts

I come to pandas from Rbackground, and I see that pandas is more complicated when it comes to selecting row or column. I had to wrestle with it for a while, then I found some ways to deal with:

我是从R后台来到 pandas 的,我发现在选择行或列时,pandas 更加复杂。我不得不与它搏斗了一段时间,然后我找到了一些处理方法:

getting the number of columns:

获取列数:

len(df.columns)  
## Here:
#df is your data.frame
#df.columns return a string, it contains column's titles of the df. 
#Then, "len()" gets the length of it.

getting the number of rows:

获取行数:

len(df.index) #It's similar.

回答by Vlad

For dataframe df, a printed comma formatted row count used while exploring data:

对于数据帧 df,在探索数据时使用的打印逗号格式的行数:

def nrow(df):
    print("{:,}".format(df.shape[0]))

Example:

例子:

nrow(my_df)
12,456,789

回答by debo

...building on Jan-Philip Gehrcke's answer.

...以 Jan-Philip Gehrcke 的回答为基础。

The reason why len(df)or len(df.index)is faster than df.shape[0]. Look at the code. df.shape is a @propertythat runs a DataFrame method calling lentwice.

之所以len(df)还是len(df.index)df.shape[0]. 看代码。df.shape 是一个@property运行len两次调用的 DataFrame 方法。

df.shape??
Type:        property
String form: <property object at 0x1127b33c0>
Source:     
# df.shape.fget
@property
def shape(self):
    """
    Return a tuple representing the dimensionality of the DataFrame.
    """
    return len(self.index), len(self.columns)

And beneath the hood of len(df)

在 len(df) 的引擎盖下

df.__len__??
Signature: df.__len__()
Source:   
    def __len__(self):
        """Returns length of info axis, but here we use the index """
        return len(self.index)
File:      ~/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py
Type:      instancemethod

len(df.index)will be slightly faster than len(df)since it has one less function call, but this is always faster than df.shape[0]

len(df.index)会比len(df)因为它少了一个函数调用而略快,但这总是比df.shape[0]

回答by Allen

In case you want to get the row count in the middle of a chained operation, you can use:

如果您想在链式操作的中间获取行数,您可以使用:

df.pipe(len)

Example:

例子:

row_count = (
      pd.DataFrame(np.random.rand(3,4))
      .reset_index()
      .pipe(len)
)

This can be useful if you don't want to put a long statement inside a len()function.

如果您不想在len()函数中放置长语句,这会很有用。

You could use __len__()instead but __len__()looks a bit weird.

你可以__len__()改用,但__len__()看起来有点奇怪。

回答by cs95

How do I get the row count of a pandas DataFrame?

如何获取熊猫 DataFrame 的行数?

This table summarises the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).

下表总结了您希望对 DataFrame(或系列,为了完整性)中的某些内容进行计数的不同情况,以及推荐的方法。

enter image description here

在此处输入图片说明

Footnotes

  1. DataFrame.countreturns counts for each column as a Seriessince the non-null count varies by column.
  2. DataFrameGroupBy.sizereturns a Series, since all columns in the same group share the same row-count.
  3. DataFrameGroupBy.countreturns a DataFrame, since the non-null count could differ across columns in the same group. To get the group-wise non-null count for a specific column, use df.groupby(...)['x'].count()where "x" is the column to count.

脚注

  1. DataFrame.count将每列的计数作为 a 返回,Series因为非空计数因列而异。
  2. DataFrameGroupBy.size返回 a Series,因为同一组中的所有列共享相同的行数。
  3. DataFrameGroupBy.count返回 a DataFrame,因为非空计数在同一组中的列之间可能不同。要获取特定列的分组非空计数,请使用df.groupby(...)['x'].count()其中“x”是要计数的列。


Minimal Code Examples

最少的代码示例

Below, I show examples of each of the methods described in the table above. First, the setup -

下面,我展示了上表中描述的每种方法的示例。一、设置——

df = pd.DataFrame({
    'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()

df

   A    B
0  a    x
1  a    x
2  b  NaN
3  b    x
4  c  NaN

s

0      x
1      x
2    NaN
3      x
4    NaN
Name: B, dtype: object

Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)

DataFrame 的行数:len(df), df.shape[0], 或len(df.index)

len(df)
# 5

df.shape[0]
# 5

len(df.index)
# 5

It seems silly to compare the performance of constant time operations, especially when the difference is on the level of "seriously, don't worry about it". But this seems to be a trend with other answers, so I'm doing the same for completeness.

比较恒定时间操作的性能似乎很愚蠢,尤其是当差异达到“认真,别担心”的水平时。但这似乎是其他答案的趋势,所以为了完整性,我也在做同样的事情。

Of the 3 methods above, len(df.index)(as mentioned in other answers) is the fastest.

在上述 3 种方法中,len(df.index)(如其他答案中所述)是最快的。

Note

  • All the methods above are constant time operations as they are simple attribute lookups.
  • df.shape(similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols). For example, df.shapereturns (8, 2)for the example here.

笔记

  • 上述所有方法都是常量时间操作,因为它们是简单的属性查找。
  • df.shape(类似于ndarray.shape)是一个返回 的元组的属性(# Rows, # Cols)。例如,这里的示例df.shape返回(8, 2)

Column Count of a DataFrame: df.shape[1], len(df.columns)

DataFrame 的列数:df.shape[1],len(df.columns)

df.shape[1]
# 2

len(df.columns)
# 2

Analogous to len(df.index), len(df.columns)is the faster of the two methods (but takes more characters to type).

类似于len(df.index),len(df.columns)是两种方法中更快的方法(但需要更多字符来输入)。

Row Count of a Series: len(s), s.size, len(s.index)

系列的行数:len(s), s.size,len(s.index)

len(s)
# 5

s.size
# 5

len(s.index)
# 5

s.sizeand len(s.index)are about the same in terms of speed. But I recommend len(df).

s.size并且len(s.index)在速度方面大致相同。但我推荐len(df).

Note
sizeis an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result as df.shape[0] * df.shape[1].

Note
size是一个属性,它返回元素的数量(=任何系列的行数)。DataFrames 还定义了一个 size 属性,它返回与df.shape[0] * df.shape[1].

Non-Null Row Count: DataFrame.countand Series.count

非空行数:DataFrame.countSeries.count

The methods described here only count non-null values (meaning NaNs are ignored).

这里描述的方法只计算非空值(意味着忽略 NaN)。

Calling DataFrame.countwill return non-NaN counts for eachcolumn:

调用DataFrame.count将返回列的非 NaN 计数:

df.count()

A    5
B    3
dtype: int64

For Series, use Series.countto similar effect:

对于系列,使用Series.count类似的效果:

s.count()
# 3

Group-wise Row Count: GroupBy.size

分组行数: GroupBy.size

For DataFrames, use DataFrameGroupBy.sizeto count the number of rows per group.

对于DataFrames,用于DataFrameGroupBy.size计算每组的行数。

df.groupby('A').size()

A
a    2
b    2
c    1
dtype: int64

Similarly, for Series, you'll use SeriesGroupBy.size.

同样,对于Series,您将使用SeriesGroupBy.size.

s.groupby(df.A).size()

A
a    2
b    2
c    1
Name: B, dtype: int64

In both cases, a Seriesis returned. This makes sense for DataFramesas well since all groups share the same row-count.

在这两种情况下,Series都会返回 a。这也很有意义,DataFrames因为所有组共享相同的行数。

Group-wise Non-Null Row Count: GroupBy.count

分组非空行计数: GroupBy.count

Similar to above, but use GroupBy.count, not GroupBy.size. Note that sizealways returns a Series, while countreturns a Seriesif called on a specific column, or else a DataFrame.

与上面类似,但使用GroupBy.count,而不是GroupBy.size。请注意,size始终返回 a Series,而count返回Seriesif 在特定列上调用,否则返回 a DataFrame

The following methods return the same thing:

以下方法返回相同的内容:

df.groupby('A')['B'].size()
df.groupby('A').size()

A
a    2
b    2
c    1
Name: B, dtype: int64

Meanwhile, for count, we have

同时,对于count,我们有

df.groupby('A').count()

   B
A   
a  2
b  1
c  0

...called on the entire GroupBy object, v/s,

...调用整个 GroupBy 对象,v/s,

df.groupby('A')['B'].count()

A
a    2
b    1
c    0
Name: B, dtype: int64

Called on a specific column.

在特定列上调用。