Python 一一循环数据帧(熊猫)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45670242/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Loop through dataframe one by one (pandas)
提问by Bondeaux
Let's say we have a dataframe with columns A, B and C:
假设我们有一个包含 A、B 和 C 列的数据框:
df = pd.DataFrame(columns =('A','B','C'), index=range(1))
The columns holds three rows of numeric values:
这些列包含三行数值:
0 A B C
1 2.1 1.8 1.6
2 2.01 1.81 1.58
3 1.9 1.84 1.52
How does one loop through every row from 1 to 3 and then execute an if statement including add some variables:
如何遍历从 1 到 3 的每一行,然后执行 if 语句,包括添加一些变量:
if B1 > 1.5
calc_temp = A1*10
calc_temp01 = C1*-10
if B2 > 1.5
calc_temp = A2*10
calc_temp01 = C2*-10
if B3 >1.5
calc_temp = A3*10
calc_temp01 = C3*-10
Is above even possible? It has to know a range of some sorts i.e. full range dataset number with some kind of counter, yes? The if statement should refer to that specific row.
以上甚至可能吗?它必须知道某种范围的某种范围,即具有某种计数器的全范围数据集编号,是吗?if 语句应引用该特定行。
回答by jezrael
回答by Romain Capron
How to iterate efficiently?
如何高效迭代?
If you really have to iterate a pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows()
is far from being the best. itertuples() can be 100 times faster.
如果您真的必须迭代一个Pandas数据帧,您可能希望避免使用 iterrows()。有不同的方法,通常的iterrows()
方法远非最好的。itertuples() 可以快 100 倍。
In short:
简而言之:
- As a general rule, use
df.itertuples(name=None)
. In particular, when you have a fixed number columns and less than 255 columns. See point (3) - Otherwise, use
df.itertuples()
except if your columns have special characters such as ' ' or '-'. See point (2) - It is possible to use
itertuples()
even if your dataframe has strange columns by using the last example. See point (4) - Only use
iterrows()
if you cannot the previous solutions. See point (1)
- 作为一般规则,使用
df.itertuples(name=None)
. 特别是当您有固定数量的列且少于 255 列时。见点(3) - 否则,
df.itertuples()
除非您的列具有特殊字符(例如“ ”或“-”),否则请使用。见点(2) - 它可以使用
itertuples()
使用最后一个例子,即使你的数据帧有奇怪列。见点(4) - 仅
iterrows()
当您无法使用以前的解决方案时才使用。见点(1)
Different methods to iterate over a pandas dataframe:
迭代熊猫数据帧的不同方法:
0) Generate a random dataframe with a million rows and 4 columns:
0) 生成一百万行和 4 列的随机数据帧:
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
1) The usual iterrows()
is convenient but damn slow:
1)通常iterrows()
很方便但很慢:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples()
is already much faster but it doesn't work with column names such as My Col-Name is very Strange
(you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a python variable name).:
2)默认itertuples()
值已经快得多,但它不适用于列名,例如My Col-Name is very Strange
(如果列重复或者列名不能简单地转换为python变量名,则应避免使用此方法):
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples()
using name=None is even faster but not really convenient as you have to define a variable per column.
3) 默认itertuples()
使用 name=None 更快,但不是很方便,因为您必须为每列定义一个变量。
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples()
is slower than the previous point but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange
.
4) 最后,nameditertuples()
比前一点慢,但您不必为每列定义一个变量,它适用于诸如My Col-Name is very Strange
.
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:
输出:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
This article is a very interesting comparison between iterrows and itertuples