Python 一一循环数据帧（熊猫）

Question

提问by Bondeaux

Let's say we have a dataframe with columns A, B and C:

假设我们有一个包含 A、B 和 C 列的数据框：

df = pd.DataFrame(columns =('A','B','C'), index=range(1))

The columns holds three rows of numeric values:

这些列包含三行数值：

0     A     B      C
1    2.1   1.8    1.6
2    2.01  1.81   1.58
3    1.9   1.84   1.52

How does one loop through every row from 1 to 3 and then execute an if statement including add some variables:

如何遍历从 1 到 3 的每一行，然后执行 if 语句，包括添加一些变量：

if B1 > 1.5
    calc_temp   = A1*10
    calc_temp01 = C1*-10
if B2 > 1.5 
    calc_temp   = A2*10
    calc_temp01 = C2*-10
if B3 >1.5
    calc_temp   = A3*10
    calc_temp01 = C3*-10

Is above even possible? It has to know a range of some sorts i.e. full range dataset number with some kind of counter, yes? The if statement should refer to that specific row.

以上甚至可能吗？它必须知道某种范围的某种范围，即具有某种计数器的全范围数据集编号，是吗？if 语句应引用该特定行。

Answer 1

回答by jezrael

I think you need iterrows:

我认为你需要iterrows：

for i, row in df.iterrows():
    if row['B'] > 1.5:
        calc_temp   = row['A'] *10
        calc_temp01 = row['C'] *-10

Answer 2

回答by Romain Capron

How to iterate efficiently?

如何高效迭代？

If you really have to iterate a pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows()is far from being the best. itertuples() can be 100 times faster.

如果您真的必须迭代一个Pandas数据帧，您可能希望避免使用 iterrows()。有不同的方法，通常的iterrows()方法远非最好的。itertuples() 可以快 100 倍。

In short:

简而言之：

As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
Otherwise, use df.itertuples()except if your columns have special characters such as ' ' or '-'. See point (2)
It is possible to use itertuples()even if your dataframe has strange columns by using the last example. See point (4)
Only use iterrows()if you cannot the previous solutions. See point (1)

作为一般规则，使用df.itertuples(name=None). 特别是当您有固定数量的列且少于 255 列时。见点（3）
否则，df.itertuples()除非您的列具有特殊字符（例如“ ”或“-”），否则请使用。见点（2）
它可以使用itertuples()使用最后一个例子，即使你的数据帧有奇怪列。见点（4）
仅iterrows()当您无法使用以前的解决方案时才使用。见点（1）

Different methods to iterate over a pandas dataframe:

迭代熊猫数据帧的不同方法：

0) Generate a random dataframe with a million rows and 4 columns:

0) 生成一百万行和 4 列的随机数据帧：

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1) The usual iterrows()is convenient but damn slow:

1）通常iterrows()很方便但很慢：

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2) The default itertuples()is already much faster but it doesn't work with column names such as My Col-Name is very Strange(you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a python variable name).:

2）默认itertuples()值已经快得多，但它不适用于列名，例如My Col-Name is very Strange（如果列重复或者列名不能简单地转换为python变量名，则应避免使用此方法）：

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3) The default itertuples()using name=None is even faster but not really convenient as you have to define a variable per column.

3) 默认itertuples()使用 name=None 更快，但不是很方便，因为您必须为每列定义一个变量。

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4) Finally, the named itertuples()is slower than the previous point but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

4) 最后，nameditertuples()比前一点慢，但您不必为每列定义一个变量，它适用于诸如My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

输出：

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

这篇文章是iterrows和itertuples的一个很有趣的对比

Python 一一循环数据帧（熊猫）

提问by Bondeaux

回答by jezrael

回答by Romain Capron

How to iterate efficiently?

如何高效迭代？

Different methods to iterate over a pandas dataframe:

迭代熊猫数据帧的不同方法：

相关推荐

最近更新

标签

Python 一一循环数据帧（熊猫）

提问by Bondeaux

回答by jezrael

回答by Romain Capron

How to iterate efficiently?

如何高效迭代？

Different methods to iterate over a pandas dataframe:

迭代熊猫数据帧的不同方法：

相关推荐

Python pip 的 `--no-cache-dir` 有什么用？

从 Mac OS X El Capitan 卸载 Python 2.7

Python Asyncio 事件循环已关闭

错误！C:\file\example.db 不是 UTF-8 编码的 ipython 笔记本

相关推荐

最近更新

标签