Python 如何获取熊猫 DataFrame 的行数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15943769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I get the row count of a pandas DataFrame?
提问by yemu
I'm trying to get the number of rows of dataframe df with Pandas, and here is my code.
我正在尝试使用 Pandas 获取数据帧 df 的行数,这是我的代码。
Method 1:
方法一:
total_rows = df.count
print total_rows +1
Method 2:
方法二:
total_rows = df['First_columnn_label'].count
print total_rows +1
Both the code snippets give me this error:
两个代码片段都给了我这个错误:
TypeError: unsupported operand type(s) for +: 'instancemethod' and 'int'
类型错误:不支持 + 的操作数类型:'instancemethod' 和 'int'
What am I doing wrong?
我究竟做错了什么?
采纳答案by root
You can use the .shapeproperty or just len(DataFrame.index). However, there are notable performance differences ( len(DataFrame.index)is fastest):
您可以使用该.shape属性或仅使用len(DataFrame.index). 但是,存在显着的性能差异(len(DataFrame.index)最快):
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.arange(12).reshape(4,3))
In [4]: df
Out[4]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [5]: df.shape
Out[5]: (4, 3)
In [6]: timeit df.shape
2.77 μs ± 644 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: timeit df[0].count()
348 μs ± 1.31 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: len(df.index)
Out[8]: 4
In [9]: timeit len(df.index)
990 ns ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
EDIT: As @Dan Allen noted in the comments len(df.index)and df[0].count()are not interchangeable as countexcludes NaNs,
编辑:作为@丹·艾伦在评论中指出len(df.index),并df[0].count()不能作为互换count不包括NaNS,
回答by Dr. Jan-Philip Gehrcke
Use len(df). This works as of pandas 0.11 or maybe even earlier.
使用len(df). 这适用于熊猫 0.11 或更早的版本。
__len__()is currently (0.12) documented with Returns length of index. Timing info, set up the same way as in root's answer:
__len__()目前 (0.12) 用Returns length of index. 时间信息,设置方式与 root 的回答相同:
In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop
In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop
Due to one additional function call it is a bit slower than calling len(df.index)directly, but this should not play any role in most use cases.
由于有一个额外的函数调用,它比len(df.index)直接调用要慢一些,但这在大多数用例中不应该发挥任何作用。
回答by Nik
Apart from above answers use can use df.axesto get the tuple with row and column indexes and then use len()function:
除了上面的答案,使用 candf.axes来获取具有行和列索引的元组,然后使用len()函数:
total_rows=len(df.axes[0])
total_cols=len(df.axes[1])
回答by Nasir Shah
Suppose dfis your dataframe then:
假设df是你的数据框:
count_row = df.shape[0] # gives number of row count
count_col = df.shape[1] # gives number of col count
Or, more succinctly,
或者,更简洁地说,
r, c = df.shape
回答by Memin
TL;DR
TL; 博士
use len(df)
用 len(df)
len()is your friend, it can be used for row counts as len(df).
len()是你的朋友,它可以用于行计数len(df)。
Alternatively, you can access all rows by df.indexand all columns by
df.columns, and as you can use the len(anyList)for getting the count of list, use
len(df.index)for getting the number of rows, and len(df.columns)for the column count.
或者,您可以通过 访问所有行df.index和所有列
df.columns,并且可以使用len(anyList)获取列表计数,使用
len(df.index)获取行数和len(df.columns)列数。
Or, you can use df.shapewhich returns the number of rows and columns together, if you want to access the number of rows only use df.shape[0]and for the number of columns only use: df.shape[1].
或者,您可以使用df.shapewhich 返回行数和列数,如果您想访问仅使用的行df.shape[0]数和仅使用列数:df.shape[1]。
回答by Catbuilts
I come to pandas from Rbackground, and I see that pandas is more complicated when it comes to selecting row or column.
I had to wrestle with it for a while, then I found some ways to deal with:
我是从R后台来到 pandas 的,我发现在选择行或列时,pandas 更加复杂。我不得不与它搏斗了一段时间,然后我找到了一些处理方法:
getting the number of columns:
获取列数:
len(df.columns)
## Here:
#df is your data.frame
#df.columns return a string, it contains column's titles of the df.
#Then, "len()" gets the length of it.
getting the number of rows:
获取行数:
len(df.index) #It's similar.
回答by Vlad
For dataframe df, a printed comma formatted row count used while exploring data:
对于数据帧 df,在探索数据时使用的打印逗号格式的行数:
def nrow(df):
print("{:,}".format(df.shape[0]))
Example:
例子:
nrow(my_df)
12,456,789
回答by debo
...building on Jan-Philip Gehrcke's answer.
...以 Jan-Philip Gehrcke 的回答为基础。
The reason why len(df)or len(df.index)is faster than df.shape[0]. Look at the code. df.shape is a @propertythat runs a DataFrame method calling lentwice.
之所以len(df)还是len(df.index)比df.shape[0]. 看代码。df.shape 是一个@property运行len两次调用的 DataFrame 方法。
df.shape??
Type: property
String form: <property object at 0x1127b33c0>
Source:
# df.shape.fget
@property
def shape(self):
"""
Return a tuple representing the dimensionality of the DataFrame.
"""
return len(self.index), len(self.columns)
And beneath the hood of len(df)
在 len(df) 的引擎盖下
df.__len__??
Signature: df.__len__()
Source:
def __len__(self):
"""Returns length of info axis, but here we use the index """
return len(self.index)
File: ~/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py
Type: instancemethod
len(df.index)will be slightly faster than len(df)since it has one less function call, but this is always faster than df.shape[0]
len(df.index)会比len(df)因为它少了一个函数调用而略快,但这总是比df.shape[0]
回答by Allen
In case you want to get the row count in the middle of a chained operation, you can use:
如果您想在链式操作的中间获取行数,您可以使用:
df.pipe(len)
Example:
例子:
row_count = (
pd.DataFrame(np.random.rand(3,4))
.reset_index()
.pipe(len)
)
This can be useful if you don't want to put a long statement inside a len()function.
如果您不想在len()函数中放置长语句,这会很有用。
You could use __len__()instead but __len__()looks a bit weird.
你可以__len__()改用,但__len__()看起来有点奇怪。
回答by cs95
How do I get the row count of a pandas DataFrame?
如何获取熊猫 DataFrame 的行数?
This table summarises the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).
下表总结了您希望对 DataFrame(或系列,为了完整性)中的某些内容进行计数的不同情况,以及推荐的方法。
Footnotes
DataFrame.countreturns counts for each column as aSeriessince the non-null count varies by column.DataFrameGroupBy.sizereturns aSeries, since all columns in the same group share the same row-count.DataFrameGroupBy.countreturns aDataFrame, since the non-null count could differ across columns in the same group. To get the group-wise non-null count for a specific column, usedf.groupby(...)['x'].count()where "x" is the column to count.
脚注
DataFrame.count将每列的计数作为 a 返回,Series因为非空计数因列而异。DataFrameGroupBy.size返回 aSeries,因为同一组中的所有列共享相同的行数。DataFrameGroupBy.count返回 aDataFrame,因为非空计数在同一组中的列之间可能不同。要获取特定列的分组非空计数,请使用df.groupby(...)['x'].count()其中“x”是要计数的列。
Minimal Code Examples
最少的代码示例
Below, I show examples of each of the methods described in the table above. First, the setup -
下面,我展示了上表中描述的每种方法的示例。一、设置——
df = pd.DataFrame({
'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()
df
A B
0 a x
1 a x
2 b NaN
3 b x
4 c NaN
s
0 x
1 x
2 NaN
3 x
4 NaN
Name: B, dtype: object
Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)
DataFrame 的行数:len(df), df.shape[0], 或len(df.index)
len(df)
# 5
df.shape[0]
# 5
len(df.index)
# 5
It seems silly to compare the performance of constant time operations, especially when the difference is on the level of "seriously, don't worry about it". But this seems to be a trend with other answers, so I'm doing the same for completeness.
比较恒定时间操作的性能似乎很愚蠢,尤其是当差异达到“认真,别担心”的水平时。但这似乎是其他答案的趋势,所以为了完整性,我也在做同样的事情。
Of the 3 methods above, len(df.index)(as mentioned in other answers) is the fastest.
在上述 3 种方法中,len(df.index)(如其他答案中所述)是最快的。
Note
- All the methods above are constant time operations as they are simple attribute lookups.
df.shape(similar tondarray.shape) is an attribute that returns a tuple of(# Rows, # Cols). For example,df.shapereturns(8, 2)for the example here.
笔记
- 上述所有方法都是常量时间操作,因为它们是简单的属性查找。
df.shape(类似于ndarray.shape)是一个返回 的元组的属性(# Rows, # Cols)。例如,这里的示例df.shape返回(8, 2)。
Column Count of a DataFrame: df.shape[1], len(df.columns)
DataFrame 的列数:df.shape[1],len(df.columns)
df.shape[1]
# 2
len(df.columns)
# 2
Analogous to len(df.index), len(df.columns)is the faster of the two methods (but takes more characters to type).
类似于len(df.index),len(df.columns)是两种方法中更快的方法(但需要更多字符来输入)。
Row Count of a Series: len(s), s.size, len(s.index)
系列的行数:len(s), s.size,len(s.index)
len(s)
# 5
s.size
# 5
len(s.index)
# 5
s.sizeand len(s.index)are about the same in terms of speed. But I recommend len(df).
s.size并且len(s.index)在速度方面大致相同。但我推荐len(df).
Note
sizeis an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result asdf.shape[0] * df.shape[1].
Note
size是一个属性,它返回元素的数量(=任何系列的行数)。DataFrames 还定义了一个 size 属性,它返回与df.shape[0] * df.shape[1].
Non-Null Row Count: DataFrame.countand Series.count
非空行数:DataFrame.count和Series.count
The methods described here only count non-null values (meaning NaNs are ignored).
这里描述的方法只计算非空值(意味着忽略 NaN)。
Calling DataFrame.countwill return non-NaN counts for eachcolumn:
调用DataFrame.count将返回每列的非 NaN 计数:
df.count()
A 5
B 3
dtype: int64
For Series, use Series.countto similar effect:
对于系列,使用Series.count类似的效果:
s.count()
# 3
Group-wise Row Count: GroupBy.size
分组行数: GroupBy.size
For DataFrames, use DataFrameGroupBy.sizeto count the number of rows per group.
对于DataFrames,用于DataFrameGroupBy.size计算每组的行数。
df.groupby('A').size()
A
a 2
b 2
c 1
dtype: int64
Similarly, for Series, you'll use SeriesGroupBy.size.
同样,对于Series,您将使用SeriesGroupBy.size.
s.groupby(df.A).size()
A
a 2
b 2
c 1
Name: B, dtype: int64
In both cases, a Seriesis returned. This makes sense for DataFramesas well since all groups share the same row-count.
在这两种情况下,Series都会返回 a。这也很有意义,DataFrames因为所有组共享相同的行数。
Group-wise Non-Null Row Count: GroupBy.count
分组非空行计数: GroupBy.count
Similar to above, but use GroupBy.count, not GroupBy.size. Note that sizealways returns a Series, while countreturns a Seriesif called on a specific column, or else a DataFrame.
与上面类似,但使用GroupBy.count,而不是GroupBy.size。请注意,size始终返回 a Series,而count返回Seriesif 在特定列上调用,否则返回 a DataFrame。
The following methods return the same thing:
以下方法返回相同的内容:
df.groupby('A')['B'].size()
df.groupby('A').size()
A
a 2
b 2
c 1
Name: B, dtype: int64
Meanwhile, for count, we have
同时,对于count,我们有
df.groupby('A').count()
B
A
a 2
b 1
c 0
...called on the entire GroupBy object, v/s,
...调用整个 GroupBy 对象,v/s,
df.groupby('A')['B'].count()
A
a 2
b 1
c 0
Name: B, dtype: int64
Called on a specific column.
在特定列上调用。


