Python Pandas 遍历行并访问列名

Question

提问by edesz

I am trying to iterate over the rows of a Python Pandas dataframe. Within each row of the dataframe, I am trying to to refer to each value along a row by its column name.

我正在尝试遍历 Python Pandas 数据帧的行。在数据框的每一行中，我试图通过其列名来引用一行中的每个值。

Here is what I have:

这是我所拥有的：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
print df
          A         B         C         D
0  0.351741  0.186022  0.238705  0.081457
1  0.950817  0.665594  0.671151  0.730102
2  0.727996  0.442725  0.658816  0.003515
3  0.155604  0.567044  0.943466  0.666576
4  0.056922  0.751562  0.135624  0.597252
5  0.577770  0.995546  0.984923  0.123392
6  0.121061  0.490894  0.134702  0.358296
7  0.895856  0.617628  0.722529  0.794110
8  0.611006  0.328815  0.395859  0.507364
9  0.616169  0.527488  0.186614  0.278792

I used this approachto iterate, but it is only giving me part of the solution - after selecting a row in each iteration, how do I access row elements by their column name?

我使用这种方法进行迭代，但它只为我提供了解决方案的一部分 -在每次迭代中选择一行后，如何按列名访问行元素？

Here is what I am trying to do:

这是我想要做的：

for row in df.iterrows():
    print row.loc[0,'A']
    print row.A
    print row.index()

My understanding is that the row is a Pandas series. But I have no way to index into the Series.

我的理解是该行是 Pandas系列。但我无法索引到系列中。

Is it possible to use column names while simultaneously iterating over rows?

是否可以在迭代行的同时使用列名？

Answer 1

回答by Steven G

I also like itertuples()

我也喜欢 itertuples()

for row in df.itertuples():
    print(row.A)
    print(row.Index)

since row is a named tuples, if you meant to access values on each row this should be MUCHfaster

由于行是一个名为元组，如果你打算访问值在每行，这应该是MUCH快

speed run :

速度运行：

df = pd.DataFrame([x for x in range(1000*1000)], columns=['A'])
st=time.time()
for index, row in df.iterrows():
    row.A
print(time.time()-st)
45.05799984931946

st=time.time()
for row in df.itertuples():
    row.A
print(time.time() - st)
0.48400020599365234

Answer 2

回答by Psidom

The item from iterrows()is not a Series, but a tuple of (index, Series), so you can unpack the tuple in the for loop like so:

来自的项目iterrows()不是系列，而是 (index, Series) 的元组，因此您可以像这样在 for 循环中解压元组：

for (idx, row) in df.iterrows():
    print(row.loc['A'])
    print(row.A)
    print(row.index)

#0.890618586836
#0.890618586836
#Index(['A', 'B', 'C', 'D'], dtype='object')

Answer 3

回答by Romain Capron

How to iterate efficiently?

如何高效迭代？

If you really have to iterate a pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows()is far from being the best. itertuples() can be 100 times faster.

如果您真的必须迭代一个Pandas数据帧，您可能希望避免使用 iterrows()。有不同的方法，通常的iterrows()方法远非最好的。itertuples() 可以快 100 倍。

In short:

简而言之：

As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
Otherwise, use df.itertuples()except if your columns have special characters such as spaces or '-'. See point (2)
It is possible to use itertuples()even if your dataframe has strange columns by using the last example. See point (4)
Only use iterrows()if you cannot the previous solutions. See point (1)

作为一般规则，使用df.itertuples(name=None). 特别是当您有固定数量的列且少于 255 列时。见点（3）
否则，df.itertuples()除非您的列具有特殊字符（例如空格或“-”），否则请使用。见点（2）
它可以使用itertuples()使用最后一个例子，即使你的数据帧有奇怪列。见点（4）
仅iterrows()当您无法使用以前的解决方案时才使用。见点（1）

Different methods to iterate over rows in a pandas dataframe:

在 Pandas 数据框中迭代行的不同方法：

Generate a random dataframe with a million rows and 4 columns:

生成一百万行和 4 列的随机数据框：

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1) The usual iterrows()is convenient but damn slow:

1）通常iterrows()很方便但很慢：

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2) The default itertuples()is already much faster but it doesn't work with column names such as My Col-Name is very Strange(you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a python variable name).:

2）默认itertuples()值已经快得多，但它不适用于列名，例如My Col-Name is very Strange（如果列重复或者列名不能简单地转换为python变量名，则应避免使用此方法）：

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3) The default itertuples()using name=None is even faster but not really convenient as you have to define a variable per column.

3) 默认itertuples()使用 name=None 更快，但不是很方便，因为您必须为每列定义一个变量。

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4) Finally, the named itertuples()is slower than the previous point but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

4) 最后，nameditertuples()比前一点慢，但您不必为每列定义一个变量，它适用于诸如My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

输出：

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

It's the same as my answer here

和我这里的回答一样

This article is a very interesting comparison between iterrows and itertuples

这篇文章是iterrows和itertuples的一个很有趣的对比

Answer 4

回答by Avik Das

for i in range(1,len(na_rm.columns)):
           print ("column name:", na_rm.columns[i])

Output :

输出：

column name: seretide_price
column name: symbicort_mkt_shr
column name: symbicort_price

Python Pandas 遍历行并访问列名

提问by edesz

回答by Steven G

回答by Psidom

回答by Romain Capron

How to iterate efficiently?

如何高效迭代？

Different methods to iterate over rows in a pandas dataframe:

在 Pandas 数据框中迭代行的不同方法：

回答by Avik Das

相关推荐

最近更新

标签

Python Pandas 遍历行并访问列名

提问by edesz

回答by Steven G

回答by Psidom

回答by Romain Capron

How to iterate efficiently?

如何高效迭代？

Different methods to iterate over rows in a pandas dataframe:

在 Pandas 数据框中迭代行的不同方法：

回答by Avik Das

相关推荐

Python Pandas：将特定的 Excel 单元格值读入变量

Python 来自熊猫数据帧的几列的总和

Python 将 Pandas 交叉表与 seaborn 堆叠条形图结合使用

Python 将灰度图像转换为 3 通道图像

相关推荐

最近更新

标签