如何在 Pandas DataFrame 中存储行和列索引的名称?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35047842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:34:47  来源:igfitidea点击:

How to store the name of rows and column index in pandas DataFrame?

pythonpandas

提问by Mathieu Dubois

I have a DataFramewith namedrows and columns indexes:

我有一个DataFrame带有命名行和列的索引:

import numpy as np
import pandas as pd

I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
                  index=I,
                  columns=C)

I have tried to store it in several formats (Excel, CSV) but when re-reading the file, the names are lost (maybe I have missed some options). Msgpackworks but it is marked as experimental so I would prefer to avoid it for now. I would also prefer to avoid pickle. Is there any way (format and option) to store the name of the 2 indexes?

我试图以多种格式(Excel、CSV)存储它,但是在重新读取文件时,名称丢失了(也许我错过了一些选项)。Msgpack有效,但它被标记为实验性的,所以我现在宁愿避免它。我也宁愿避免pickle. 有没有办法(格式和选项)来存储 2 个索引的名称?

EDIT:I know how to write and read CSV with pandas. The problem is to save the name of the column index and of the row index.

编辑:我知道如何用 Pandas 编写和读取 CSV。问题是保存列索引和行索引的名称。

回答by Stop harming Monica

You can use hdf.

您可以使用 hdf。

import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="columns")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
print(df)

columns      col0      col1      col2
rows                                 
a        0.098497  0.918954  0.642800
b        0.168266  0.678434  0.455059
c        0.434939  0.244027  0.599400
d        0.877356  0.053085  0.182661

df.to_hdf('test.hdf', 'test')
print(pd.read_hdf('test.hdf'))

columns      col0      col1      col2
rows                                 
a        0.098497  0.918954  0.642800
b        0.168266  0.678434  0.455059
c        0.434939  0.244027  0.599400
d        0.877356  0.053085  0.182661

回答by albert

You can export the DataFrame to a csv-file using .to_csv()and read it back in using .read_csv(). I extended the code you already had as follows:

您可以使用将 DataFrame 导出到 csv 文件.to_csv()并使用.read_csv(). 我扩展了您已有的代码,如下所示:

#!/usr/bin/env python3
# coding: utf-8

import numpy as np
import pandas as pd

I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)

# export DataFrame to csv
df.to_csv('out.csv')

# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)

So the DataFrame dflooks like this:

所以 DataFramedf看起来像这样:

       col0      col1      col2
a  0.590016  0.834033  0.535310
b  0.421589  0.897302  0.029500
c  0.373580  0.109005  0.239181
d  0.473872  0.075918  0.751628

The csv-file out.csvlooks like this:

csv 文件out.csv如下所示:

,col0,col1,col2
a,0.5900160748408918,0.8340332218911729,0.5353103406507513
b,0.42158899389955884,0.8973015040807538,0.029500416731096046
c,0.37357951184145965,0.10900495955642386,0.2391805787788026
d,0.47387186813644167,0.07591794371425187,0.7516279365972057

Reading the data back in leads to the DataFrame df_inas follows:

读回数据会导致 DataFramedf_in如下:

       col0      col1      col2
a  0.590016  0.834033  0.535310
b  0.421589  0.897302  0.029500
c  0.373580  0.109005  0.239181
d  0.473872  0.075918  0.751628

So df2is exactly the same as dfwhich shows that export and the desired import is working as expected.

Sodf2df显示导出和所需导入按预期工作的完全相同。

EDIT to export column and index names:

编辑以导出列和索引名称

df.to_csv('out.csv', index_label=[df.index.name, df.columns.name])

However, this makes re-importing a bit difficult since the columns name is added as a additional column. Normally, this is useful for multi-indexed data, but leads to an additional empty column here.

但是,这使得重新导入有点困难,因为列名称是作为附加列添加的。通常,这对于多索引数据很有用,但会在此处导致额外的空列。

So I would suggest to export the index name, only:

所以我建议只导出索引名称:

# export DataFrame to csv
df.to_csv('out.csv', index_label=df.index.name)

# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)

which leads to df_inas:

这导致df_in

          col0      col1      col2
rows                              
a     0.442467  0.959260  0.626502
b     0.639044  0.989795  0.853002
c     0.576137  0.350260  0.532920
d     0.235698  0.095978  0.194151

I do not know why you need to export the names of both index and colums. If you simply want to access the row or column names you can get their label like this:

我不知道为什么您需要导出索引和列的名称。如果您只想访问行或列名称,您可以像这样获取它们的标签:

column_labels = df.columns.get_values()
>>> array(['col0', 'col1', 'col2'], dtype=object)

index_labels = df.index.get_values()
>>> array(['a', 'b', 'c', 'd'], dtype=object)

回答by Borja

Does your DataFrame here actually have the index name? On my computer I need to write

你这里的 DataFrame 真的有索引名称吗?在我的电脑上我需要写

I = pd.Index(["a", "b", "c", "d"], name="rows")

instead of

代替

I = pd.Index(["a", "b", "c", "d"], "rows")

for the name to be actually assigned. Then using df.to_csv() and pd.read_csv() the index name does get preserved (in the csv the index name is just saved like another column).

用于实际分配的名称。然后使用 df.to_csv() 和 pd.read_csv() 索引名称确实得到保留(在 csv 中,索引名称就像另一列一样保存)。

If that doesn't work, do you need to be able to open it with programs other than Python? If not, you could use pickle which should leave the object untouched:

如果这不起作用,您是否需要能够使用 Python 以外的程序打开它?如果没有,您可以使用泡菜,它应该保持对象不变:

import pickle

pickle.dump(df, open("File.pickled", "wb"))

df_read = pickle.load(open("File.pickled", "rb"))

EDIT:If you want the name of the columns index to be preserved with to_csv() as well you could do the following hack:

编辑:如果您还希望使用 to_csv() 保留列索引的名称,您可以执行以下操作:

Save it with (converts the columns index to a row, hence setting its column index name as a row index value)

保存它(将列索引转换为行,因此将其列索引名称设置为行索引值)

df.T.reset_index().T.to_csv("DataFrame.csv")

And then read it with either of these two, first one being (very) slightly faster:

然后用这两个中的任何一个阅读它,第一个是(非常)稍微快一点:

df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read.columns = df_read.loc['cols']
df_read = df_read.drop('cols', axis=0)

or

或者

df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read = df_read.T.set_index('cols').T

回答by ehhh

header=True(for the column names) and index=True(for the index values) in df.to_csvshould do the trick. Both are Trueby default, say the docs. However, you will have to manually set the index after loading, it is being loaded as a regular column. For this, I have added a name to the index:

header=True(对于列名)和index=True(对于索引值) indf.to_csv应该可以解决问题。两者都是True默认的,比如文档。但是,您必须在加载后手动设置索引,它是作为常规列加载的。为此,我在索引中添加了一个名称:

import numpy as np
import pandas as pd

# your code:
I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
                  index=I,
                  columns=C)

# name the index
I.name = 'index'

print "original:"
print df
print ""

df.to_csv("~/test.csv", index=True, header=True)

df2 = pd.read_csv("~/test.csv").set_index('index')

print "read from disk:"
print df2
print ""

Output:

输出:

original:
           col0      col1      col2
index                              
a      0.455378  0.830872  0.495953
b      0.707146  0.832009  0.112540
c      0.894997  0.156364  0.521047
d      0.775462  0.482554  0.578177

read from disk:
           col0      col1      col2
index                              
a      0.455378  0.830872  0.495953
b      0.707146  0.832009  0.112540
c      0.894997  0.156364  0.521047
d      0.775462  0.482554  0.578177