Python 如何遍历 Pandas 中 DataFrame 中的行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16476924/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to iterate over rows in a DataFrame in Pandas?
提问by Roman
I have a DataFramefrom pandas:
我有一个DataFrame来自熊猫的:
import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df
Output:
输出:
c1 c2
0 10 100
1 11 110
2 12 120
Now I want to iterate over the rows of this frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. For example:
现在我想遍历这个框架的行。对于每一行,我希望能够通过列名访问其元素(单元格中的值)。例如:
for row in df.rows:
print row['c1'], row['c2']
Is it possible to do that in pandas?
有可能在熊猫中做到这一点吗?
I found this similar question. But it does not give me the answer I need. For example, it is suggested there to use:
我发现了这个类似的问题。但它没有给我我需要的答案。例如,建议在那里使用:
for date, row in df.T.iteritems():
or
或者
for row in df.iterrows():
But I do not understand what the rowobject is and how I can work with it.
但我不明白这个row对象是什么以及我如何使用它。
采纳答案by waitingkuo
DataFrame.iterrowsis a generator which yield both index and row
DataFrame.iterrows是一个生成索引和行的生成器
import pandas as pd
import numpy as np
df = pd.DataFrame([{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}])
for index, row in df.iterrows():
print(row['c1'], row['c2'])
Output:
10 100
11 110
12 120
回答by Wes McKinney
You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Seriesobjects have to be created.
你应该使用df.iterrows(). 尽管逐行迭代并不是特别有效,因为Series必须创建对象。
回答by cheekybastard
You can also use df.apply()to iterate over rows and access multiple columns for a function.
您还可以df.apply()用于遍历行并访问函数的多列。
def valuation_formula(x, y):
return x * y * 0.5
df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)
回答by e9t
While iterrows()is a good option, sometimes itertuples()can be much faster:
虽然iterrows()是一个不错的选择,但有时itertuples()可以更快:
df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})
%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop
%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 μs per loop
回答by PJay
You can use the df.iloc function as follows:
您可以按如下方式使用 df.iloc 函数:
for i in range(0, len(df)):
print df.iloc[i]['c1'], df.iloc[i]['c2']
回答by viddik13
First consider if you really need to iterateover rows in a DataFrame. See this answerfor alternatives.
首先考虑是否真的需要遍历DataFrame 中的行。有关替代方案,请参阅此答案。
If you still need to iterate over rows, you can use methods below. Note some important caveatswhich are not mentioned in any of the other answers.
如果您仍然需要遍历行,您可以使用下面的方法。请注意其他任何答案中未提及的一些 重要警告。
for index, row in df.iterrows(): print(row["c1"], row["c2"])for row in df.itertuples(index=True, name='Pandas'): print(row.c1, row.c2)
for index, row in df.iterrows(): print(row["c1"], row["c2"])for row in df.itertuples(index=True, name='Pandas'): print(row.c1, row.c2)
itertuples()is supposed to be faster than iterrows()
itertuples()应该比 iterrows()
But be aware, according to the docs (pandas 0.24.2 at the moment):
但请注意,根据文档(目前为熊猫 0.24.2):
iterrows:
dtypemight not match from row to rowBecause iterrows returns a Series for each row, it does not preservedtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
iterrows: Do not modify rows
You should never modifysomething you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Use DataFrame.apply()instead:
new_df = df.apply(lambda x: x * 2)itertuples:
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
iterrows:
dtype行与行之间可能不匹配因为 iterrows 为每一行返回一个 Series,所以它不会跨行保留dtypes(跨列保留 DataFrames 的 dtypes)。为了在迭代行时保留 dtypes,最好使用 itertuples(),它返回值的命名元组,并且通常比 iterrows() 快得多
iterrows:不修改行
你永远不应该修改你正在迭代的东西。这不能保证在所有情况下都有效。根据数据类型,迭代器返回一个副本而不是一个视图,写入它没有任何效果。
使用DataFrame.apply()代替:
new_df = df.apply(lambda x: x * 2)迭代:
如果列名是无效的 Python 标识符、重复或以下划线开头,则它们将重命名为位置名称。对于大量列 (>255),将返回常规元组。
See pandas docs on iterationfor more details.
有关更多详细信息,请参阅有关迭代的 pandas 文档。
回答by CONvid19
To loop all rows in a dataframeyou can use:
要循环 a 中的所有行,dataframe您可以使用:
for x in range(len(date_example.index)):
print date_example['Date'].iloc[x]
回答by Grag2015
for ind in df.index:
print df['c1'][ind], df['c2'][ind]
回答by piRSquared
You can write your own iterator that implements namedtuple
您可以编写自己的迭代器来实现 namedtuple
from collections import namedtuple
def myiter(d, cols=None):
if cols is None:
v = d.values.tolist()
cols = d.columns.values.tolist()
else:
j = [d.columns.get_loc(c) for c in cols]
v = d.values[:, j].tolist()
n = namedtuple('MyTuple', cols)
for line in iter(v):
yield n(*line)
This is directly comparable to pd.DataFrame.itertuples. I'm aiming at performing the same task with more efficiency.
这与pd.DataFrame.itertuples. 我的目标是以更高的效率执行相同的任务。
For the given dataframe with my function:
对于具有我的函数的给定数据框:
list(myiter(df))
[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]
Or with pd.DataFrame.itertuples:
或与pd.DataFrame.itertuples:
list(df.itertuples(index=False))
[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]
A comprehensive test
We test making all columns available and subsetting the columns.
综合测试
我们测试使所有列可用并设置列子集。
def iterfullA(d):
return list(myiter(d))
def iterfullB(d):
return list(d.itertuples(index=False))
def itersubA(d):
return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))
def itersubB(d):
return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='iterfullA iterfullB itersubA itersubB'.split(),
dtype=float
)
for i in res.index:
d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
回答by James L.
You can also do numpyindexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.
您还可以进行numpy索引以提高速度。对于某些应用程序,它并不是真正的迭代,但比迭代要好得多。
subset = row['c1'][0:5]
all = row['c1'][:]
You may also want to cast it to an array. These indexes/selections are supposed to act like Numpy arrays already but I ran into issues and needed to cast
您可能还想将其转换为数组。这些索引/选择应该已经像 Numpy 数组一样,但我遇到了问题,需要进行转换
np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) #resize every image in an hdf5 file


