Python 索引 Pandas 数据框:整数行、命名列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28754603/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Indexing Pandas data frames: integer rows, named columns
提问by Hillary Sanders
Say df
is a pandas dataframe.
说df
是一个熊猫数据框。
df.loc[]
only accepts namesdf.iloc[]
only accepts integers (actual placements)df.ix[]
accepts both names and integers:
df.loc[]
只接受名字df.iloc[]
只接受整数(实际位置)df.ix[]
接受名称和整数:
When referencing rows, df.ix[row_idx, ]
only wants to be given names. e.g.
引用行时,df.ix[row_idx, ]
只想给定名称。例如
df = pd.DataFrame({'a' : ['one', 'two', 'three','four', 'five', 'six'],
'1' : np.arange(6)})
df = df.ix[2:6]
print(df)
1 a
2 2 three
3 3 four
4 4 five
5 5 six
df.ix[0, 'a']
throws an error, it doesn't give return 'two'.
抛出错误,它不会返回“二”。
When referencing columns, iloc is prefers integers, not names. e.g.
引用列时, iloc 更喜欢整数,而不是名称。例如
df.ix[2, 1]
returns 'three', not 2. (Although df.idx[2, '1']
does return 2
).
返回“三”,而不是 2。(虽然df.idx[2, '1']
确实返回2
)。
Oddly, I'd like the exact opposite functionality. Usually my column names are very meaningful, so in my code I reference them directly. But due to a lot of observation cleaning, the row names in my pandas data frames don't usually correspond to range(len(df))
.
奇怪的是,我想要完全相反的功能。通常我的列名很有意义,所以在我的代码中我直接引用了它们。但是由于大量的观察清理,我的pandas数据框中的行名称通常不对应于range(len(df))
.
I realize I can use:
我意识到我可以使用:
df.iloc[0].loc['a'] # returns three
But it seems ugly! Does anyone know of a better way to do this, so that the code would look like this?
但是好像很丑!有谁知道更好的方法来做到这一点,以便代码看起来像这样?
df.foo[0, 'a'] # returns three
In fact, is it possible to add on my own new method to pandas.core.frame.DataFrame
s, so e.g.
df.idx(rows, cols)
is in fact df.iloc[rows].loc[cols]
?
事实上,是否可以将我自己的新方法添加到pandas.core.frame.DataFrame
s 中,例如
df.idx(rows, cols)
实际上是这样df.iloc[rows].loc[cols]
?
回答by brunston
It's a late answer, but @unutbu's comment is still valid and a great solution to this problem.
这是一个迟到的答案,但@unutbu 的评论仍然有效,并且是解决这个问题的一个很好的方法。
To index a DataFrame with integer rows and named columns (labeled columns):
要使用整数行和命名列(标记列)索引 DataFrame:
df.loc[df.index[#], 'NAME']
where #
is a valid integer index and NAME
is the name of the column.
df.loc[df.index[#], 'NAME']
其中#
是有效的整数索引,NAME
是列的名称。
回答by Krishna
we can reset the index and then use 0 based indexing like this
我们可以重置索引,然后像这样使用基于 0 的索引
df.reset_index(drop=True).loc[0,'a']
df.reset_index(drop=True).loc[0,'a']
edit: removed []
from col name index 'a'
so it just outputs the value
编辑:[]
从列名索引中删除,'a'
所以它只输出值
回答by prashansa agrawal
Something like df["a"][0] is working fine for me. You may try it out!
df["a"][0] 之类的东西对我来说很好用。你可以试试看!
回答by Darkonaut
For getting or setting a singlevalue in a DataFrame
by row/column labels, you better use DataFrame.at
instead of DataFrame.loc
, as it is ...
要在按行/列标签中获取或设置单个值DataFrame
,最好使用DataFrame.at
代替DataFrame.loc
,因为它是...
- faster
- you are more explicit about wanting to access only a single value.
- 快点
- 您更明确地希望只访问一个值。
How others have already shown, if you start out with an integer position for the row, you still have to find the row-label first with DataFrame.index
as DataFrame.at
only accepts labels:
其他人已经如何显示,如果您从行的整数位置开始,您仍然必须首先使用DataFrame.index
asDataFrame.at
只接受标签找到行标签:
df.at[df.index[0], 'a']
# Out: 'three'
Benchmark:
基准:
%timeit df.at[df.index[0], 'a']
# 7.57 μs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.loc[df.index[0], 'a']
# 10.9 μs ± 53.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.iloc[0, df.columns.get_loc("a")]
# 13.3 μs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For completeness:
为了完整性:
DataFrame.iat
for accessing a single value for a row/column pair by integer position.
DataFrame.iat
用于按整数位置访问行/列对的单个值。
回答by Ben
The existing answers seem short-sighted to me.
现有的答案对我来说似乎是短视的。
Problematic Solutions
有问题的解决方案
df.loc[df.index[0], 'a']
The strategy here is to get the row label of the 0th row and then use.loc
as normal. I see two issues.- If df has repeated row labels,
df.loc[df.index[0], 'a']
could return multiple rows. .loc
is slower than.iloc
so you're sacrificing speed here.
- If df has repeated row labels,
df.reset_index(drop=True).loc[0,'a']
The strategy here is to reset the index so the row labels become 0, 1, 2, ... thus.loc[0]
gives the same result as.iloc[0]
. Still, the problem here is runtime, as.loc
is slower than.iloc
and you'll incur a cost for resetting the index.
df.loc[df.index[0], 'a']
这里的策略是获取第0行的行标签,然后.loc
正常使用。我看到两个问题。- 如果 df 有重复的行标签,则
df.loc[df.index[0], 'a']
可能返回多行。 .loc
比.iloc
你在这里牺牲速度慢。
- 如果 df 有重复的行标签,则
df.reset_index(drop=True).loc[0,'a']
这里的策略是重置索引,使行标签变为 0, 1, 2, ... 从而.loc[0]
给出与 相同的结果.iloc[0]
。不过,这里的问题是运行时,因为.loc
它比它慢,.iloc
并且您将产生重置索引的成本。
Better Solution
更好的解决方案
I suggest following @Landmaster's solution in the comments.
我建议在评论中遵循@Landmaster 的解决方案。
df.iloc[0, df.columns.get_loc("a")]
df.iloc[0, df.columns.get_loc("a")]
Essentially, this is the same as df.iloc[0, 0]
except we get the column index dynamically using df.columns.get_loc("a")
. The multi-column generalization of this would be something like
本质上,这与df.iloc[0, 0]
我们使用 动态获取列索引相同df.columns.get_loc("a")
。这的多列概括将类似于
df.iloc[0, [df.columns.get_loc(c) for c in ['a', 'b', 'c']]]
df.iloc[0, [df.columns.get_loc(c) for c in ['a', 'b', 'c']]]