Python Pandas 遍历行并访问列名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43619896/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas iterate over rows and access column names
提问by edesz
I am trying to iterate over the rows of a Python Pandas dataframe. Within each row of the dataframe, I am trying to to refer to each value along a row by its column name.
我正在尝试遍历 Python Pandas 数据帧的行。在数据框的每一行中,我试图通过其列名来引用一行中的每个值。
Here is what I have:
这是我所拥有的:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
print df
A B C D
0 0.351741 0.186022 0.238705 0.081457
1 0.950817 0.665594 0.671151 0.730102
2 0.727996 0.442725 0.658816 0.003515
3 0.155604 0.567044 0.943466 0.666576
4 0.056922 0.751562 0.135624 0.597252
5 0.577770 0.995546 0.984923 0.123392
6 0.121061 0.490894 0.134702 0.358296
7 0.895856 0.617628 0.722529 0.794110
8 0.611006 0.328815 0.395859 0.507364
9 0.616169 0.527488 0.186614 0.278792
I used this approachto iterate, but it is only giving me part of the solution - after selecting a row in each iteration, how do I access row elements by their column name?
我使用这种方法进行迭代,但它只为我提供了解决方案的一部分 -在每次迭代中选择一行后,如何按列名访问行元素?
Here is what I am trying to do:
这是我想要做的:
for row in df.iterrows():
print row.loc[0,'A']
print row.A
print row.index()
My understanding is that the row is a Pandas series. But I have no way to index into the Series.
我的理解是该行是 Pandas系列。但我无法索引到系列中。
Is it possible to use column names while simultaneously iterating over rows?
是否可以在迭代行的同时使用列名?
回答by Steven G
I also like itertuples()
我也喜欢 itertuples()
for row in df.itertuples():
print(row.A)
print(row.Index)
since row is a named tuples, if you meant to access values on each row this should be MUCHfaster
由于行是一个名为元组,如果你打算访问值在每行,这应该是MUCH快
speed run :
速度运行:
df = pd.DataFrame([x for x in range(1000*1000)], columns=['A'])
st=time.time()
for index, row in df.iterrows():
row.A
print(time.time()-st)
45.05799984931946
st=time.time()
for row in df.itertuples():
row.A
print(time.time() - st)
0.48400020599365234
回答by Psidom
The item from iterrows()
is not a Series, but a tuple of (index, Series), so you can unpack the tuple in the for loop like so:
来自的项目iterrows()
不是系列,而是 (index, Series) 的元组,因此您可以像这样在 for 循环中解压元组:
for (idx, row) in df.iterrows():
print(row.loc['A'])
print(row.A)
print(row.index)
#0.890618586836
#0.890618586836
#Index(['A', 'B', 'C', 'D'], dtype='object')
回答by Romain Capron
How to iterate efficiently?
如何高效迭代?
If you really have to iterate a pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows()
is far from being the best. itertuples() can be 100 times faster.
如果您真的必须迭代一个Pandas数据帧,您可能希望避免使用 iterrows()。有不同的方法,通常的iterrows()
方法远非最好的。itertuples() 可以快 100 倍。
In short:
简而言之:
- As a general rule, use
df.itertuples(name=None)
. In particular, when you have a fixed number columns and less than 255 columns. See point (3) - Otherwise, use
df.itertuples()
except if your columns have special characters such as spaces or '-'. See point (2) - It is possible to use
itertuples()
even if your dataframe has strange columns by using the last example. See point (4) - Only use
iterrows()
if you cannot the previous solutions. See point (1)
- 作为一般规则,使用
df.itertuples(name=None)
. 特别是当您有固定数量的列且少于 255 列时。见点(3) - 否则,
df.itertuples()
除非您的列具有特殊字符(例如空格或“-”),否则请使用。见点(2) - 它可以使用
itertuples()
使用最后一个例子,即使你的数据帧有奇怪列。见点(4) - 仅
iterrows()
当您无法使用以前的解决方案时才使用。见点(1)
Different methods to iterate over rows in a pandas dataframe:
在 Pandas 数据框中迭代行的不同方法:
Generate a random dataframe with a million rows and 4 columns:
生成一百万行和 4 列的随机数据框:
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
1) The usual iterrows()
is convenient but damn slow:
1)通常iterrows()
很方便但很慢:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples()
is already much faster but it doesn't work with column names such as My Col-Name is very Strange
(you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a python variable name).:
2)默认itertuples()
值已经快得多,但它不适用于列名,例如My Col-Name is very Strange
(如果列重复或者列名不能简单地转换为python变量名,则应避免使用此方法):
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples()
using name=None is even faster but not really convenient as you have to define a variable per column.
3) 默认itertuples()
使用 name=None 更快,但不是很方便,因为您必须为每列定义一个变量。
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples()
is slower than the previous point but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange
.
4) 最后,nameditertuples()
比前一点慢,但您不必为每列定义一个变量,它适用于诸如My Col-Name is very Strange
.
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:
输出:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
It's the same as my answer here
This article is a very interesting comparison between iterrows and itertuples
回答by Avik Das
for i in range(1,len(na_rm.columns)):
print ("column name:", na_rm.columns[i])
Output :
输出 :
column name: seretide_price
column name: symbicort_mkt_shr
column name: symbicort_price