如何引用 Pandas 数据框的索引?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23314564/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:58:10  来源:igfitidea点击:

How do I refer to the index of my Pandas dataframe?

pythonpandasindexingdataframe

提问by orome

I have a Pandas dataframe where I have designated some of the columns as indices:

我有一个 Pandas 数据框,我将其中的一些列指定为索引:

planets_dataframe.set_index(['host','name'], inplace=True)

and would like to be able to refer to these indices in a variety of contexts. Using the name of an index works fine in queries

并且希望能够在各种上下文中引用这些索引。在查询中使用索引名称工作正常

planets_dataframe.query('host == "PSR 1257 12"')

but results in an error if try to use it to get a list of the values of an index as I could when it was a column

但是如果尝试使用它来获取索引值的列表,就像我在它是一列时一样,会导致错误

planets_dataframe.name
#AttributeError: 'DataFrame' object has no attribute 'name'

or to use it to list results as I could when it was a "regular" column

或者在它是“常规”列时使用它来列出结果

planets_dataframe.query('30 > mass > 20 and discoveryyear > 2009')['name']
#KeyError: u'no item named name'

How do I refer to the "columns" of the dataframe that I'm using as indexes?

如何引用用作索引的数据框的“列”?



Before set_index:

之前set_index

planets_dataframe.columns
# Index([u'name', u'lastupdate', u'temperature', u'semimajoraxis', u'discoveryyear', u'calculated', u'period', u'age', u'mass', u'host', u'verification', u'transittime', u'eccentricity', u'radius', u'discoverymethod', u'inclination'], dtype='object')

After set_index:

之后set_index

planets_dataframe.columns
#Index([u'lastupdate', u'temperature', u'semimajoraxis', u'discoveryyear', u'calculated', u'period', u'age', u'mass', u'verification', u'transittime', u'eccentricity', u'radius', u'discoverymethod', u'inclination'], dtype='object')

回答by BrenBarn

I think you have a slight misunderstanding of what indexes are. You don't just "designate" columns as indexes; that is, you don't just "tag" certain columns with info that says "this is an index". The index is a separate data structure that can hold data that aren't even present in the columns. If you do set_index, you movethose columns into the index, so they no longer exist as regular columns. This is why you can no longer use them in the ways you mention: they aren't there anymore.

我认为您对什么是索引有点误解。您不只是“指定”列作为索引;也就是说,您不只是用“这是一个索引”的信息“标记”某些列。索引是一个单独的数据结构,可以保存列中甚至不存在的数据。如果这样做set_index,则将这些列移动到索引中,因此它们不再作为常规列存在。这就是为什么您不能再以您提到的方式使用它们的原因:它们不再存在。

One thing you can do is, when using set_index, pass drop=Falseto tell it to keep the columns as columns in addition to putting them in the index (effectively copying them to the index rather than moving them), e.g., df.set_index('SomeColumn', drop=False). However, you should be aware that the index and column are still distinct, so for instance if you modify the column values this will not affect what's stored in the index.

您可以做的一件事是,在使用set_index, passdrop=False时告诉它除了将列放在索引中(有效地将它们复制到索引而不是移动它们)之外,还将列保留为列,例如,df.set_index('SomeColumn', drop=False). 但是,您应该知道索引和列仍然是不同的,因此例如,如果您修改列值,这不会影响索引中存储的内容。

The upshot is that indexes aren't really columns of the DataFrame, so if you want to be able to use some data as both an index and a column, you need to duplicate it in both places. There is some discussion of this issue here.

结果是索引并不是 DataFrame 的真正列,因此如果您希望能够将某些数据用作索引和列,则需要在两个位置复制它。有这个问题的一些讨论在这里

回答by unutbu

The information is accessible using the index's get_level_valuesmethod:

可以使用索引的get_level_values方法访问信息:

import numpy as np
import pandas as pd
np.random.seed(1)

df = pd.DataFrame(np.random.randint(4, size=(10,4)), columns=list('ABCD'))    
idf = df.set_index(list('AB'))

idf.index.get_level_values('A')is roughly equivalent to df['A']. Note the change in type and dtype, however:

idf.index.get_level_values('A')大致相当于df['A']. 请注意 type 和 dtype 的变化,但是:

print(df['A'])
# 0    1
# 1    3
# 2    3
# 3    0
# 4    2
# 5    2
# 6    3
# 7    1
# 8    3
# 9    3
# Name: A, dtype: int32

def level(df, lvl):
    return df.index.get_level_values(lvl)

print(level(idf, 'A'))
# Int64Index([1, 3, 3, 0, 2, 2, 3, 1, 3, 3], dtype='int64')

And here again, instead of selecting the column with ['A'], you can get the equivalent information using .index.get_level_values('A'):

同样,在这里['A'],您可以使用 获取等效信息,而不是使用 选择列.index.get_level_values('A')

print(df.query('3>C>0 and D>0')['A'])
# 8    3
# Name: A, dtype: int32

print(level(idf.query('3>C>0 and D>0'), 'A'))
# Int64Index([3], dtype='int64')


PS. One of the golden rules of database design is, "Never repeat the same data in two places" since sooner or later the data will become inconsistent and thus corrupted. So I would recommend againstkeeping the data as both a column and an index, primarily because it could lead to data corruption, but also because it could be an inefficient use of memory.

附注。数据库设计的黄金法则之一是“永远不要在两个地方重复相同的数据”,因为数据迟早会变得不一致并因此损坏。所以我建议不要将数据既作为列又作为索引,主要是因为它可能导致数据损坏,但也因为它可能会导致内存使用效率低下。