如何在 Pandas DataFrame 中存储行和列索引的名称?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35047842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to store the name of rows and column index in pandas DataFrame?
提问by Mathieu Dubois
I have a DataFrame
with namedrows and columns indexes:
我有一个DataFrame
带有命名行和列的索引:
import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
index=I,
columns=C)
I have tried to store it in several formats (Excel, CSV) but when re-reading the file, the names are lost (maybe I have missed some options). Msgpack
works but it is marked as experimental so I would prefer to avoid it for now. I would also prefer to avoid pickle
. Is there any way (format and option) to store the name of the 2 indexes?
我试图以多种格式(Excel、CSV)存储它,但是在重新读取文件时,名称丢失了(也许我错过了一些选项)。Msgpack
有效,但它被标记为实验性的,所以我现在宁愿避免它。我也宁愿避免pickle
. 有没有办法(格式和选项)来存储 2 个索引的名称?
EDIT:I know how to write and read CSV with pandas. The problem is to save the name of the column index and of the row index.
编辑:我知道如何用 Pandas 编写和读取 CSV。问题是保存列索引和行索引的名称。
回答by Stop harming Monica
You can use hdf.
您可以使用 hdf。
import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="columns")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
print(df)
columns col0 col1 col2
rows
a 0.098497 0.918954 0.642800
b 0.168266 0.678434 0.455059
c 0.434939 0.244027 0.599400
d 0.877356 0.053085 0.182661
df.to_hdf('test.hdf', 'test')
print(pd.read_hdf('test.hdf'))
columns col0 col1 col2
rows
a 0.098497 0.918954 0.642800
b 0.168266 0.678434 0.455059
c 0.434939 0.244027 0.599400
d 0.877356 0.053085 0.182661
回答by albert
You can export the DataFrame to a csv-file using .to_csv()
and read it back in using .read_csv()
. I extended the code you already had as follows:
您可以使用将 DataFrame 导出到 csv 文件.to_csv()
并使用.read_csv()
. 我扩展了您已有的代码,如下所示:
#!/usr/bin/env python3
# coding: utf-8
import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
# export DataFrame to csv
df.to_csv('out.csv')
# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)
So the DataFrame df
looks like this:
所以 DataFramedf
看起来像这样:
col0 col1 col2
a 0.590016 0.834033 0.535310
b 0.421589 0.897302 0.029500
c 0.373580 0.109005 0.239181
d 0.473872 0.075918 0.751628
The csv-file out.csv
looks like this:
csv 文件out.csv
如下所示:
,col0,col1,col2
a,0.5900160748408918,0.8340332218911729,0.5353103406507513
b,0.42158899389955884,0.8973015040807538,0.029500416731096046
c,0.37357951184145965,0.10900495955642386,0.2391805787788026
d,0.47387186813644167,0.07591794371425187,0.7516279365972057
Reading the data back in leads to the DataFrame df_in
as follows:
读回数据会导致 DataFramedf_in
如下:
col0 col1 col2
a 0.590016 0.834033 0.535310
b 0.421589 0.897302 0.029500
c 0.373580 0.109005 0.239181
d 0.473872 0.075918 0.751628
So df2
is exactly the same as df
which shows that export and the desired import is working as expected.
Sodf2
与df
显示导出和所需导入按预期工作的完全相同。
EDIT to export column and index names:
编辑以导出列和索引名称:
df.to_csv('out.csv', index_label=[df.index.name, df.columns.name])
However, this makes re-importing a bit difficult since the columns name is added as a additional column. Normally, this is useful for multi-indexed data, but leads to an additional empty column here.
但是,这使得重新导入有点困难,因为列名称是作为附加列添加的。通常,这对于多索引数据很有用,但会在此处导致额外的空列。
So I would suggest to export the index name, only:
所以我建议只导出索引名称:
# export DataFrame to csv
df.to_csv('out.csv', index_label=df.index.name)
# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)
which leads to df_in
as:
这导致df_in
:
col0 col1 col2
rows
a 0.442467 0.959260 0.626502
b 0.639044 0.989795 0.853002
c 0.576137 0.350260 0.532920
d 0.235698 0.095978 0.194151
I do not know why you need to export the names of both index and colums. If you simply want to access the row or column names you can get their label like this:
我不知道为什么您需要导出索引和列的名称。如果您只想访问行或列名称,您可以像这样获取它们的标签:
column_labels = df.columns.get_values()
>>> array(['col0', 'col1', 'col2'], dtype=object)
index_labels = df.index.get_values()
>>> array(['a', 'b', 'c', 'd'], dtype=object)
回答by Borja
Does your DataFrame here actually have the index name? On my computer I need to write
你这里的 DataFrame 真的有索引名称吗?在我的电脑上我需要写
I = pd.Index(["a", "b", "c", "d"], name="rows")
instead of
代替
I = pd.Index(["a", "b", "c", "d"], "rows")
for the name to be actually assigned. Then using df.to_csv() and pd.read_csv() the index name does get preserved (in the csv the index name is just saved like another column).
用于实际分配的名称。然后使用 df.to_csv() 和 pd.read_csv() 索引名称确实得到保留(在 csv 中,索引名称就像另一列一样保存)。
If that doesn't work, do you need to be able to open it with programs other than Python? If not, you could use pickle which should leave the object untouched:
如果这不起作用,您是否需要能够使用 Python 以外的程序打开它?如果没有,您可以使用泡菜,它应该保持对象不变:
import pickle
pickle.dump(df, open("File.pickled", "wb"))
df_read = pickle.load(open("File.pickled", "rb"))
EDIT:If you want the name of the columns index to be preserved with to_csv() as well you could do the following hack:
编辑:如果您还希望使用 to_csv() 保留列索引的名称,您可以执行以下操作:
Save it with (converts the columns index to a row, hence setting its column index name as a row index value)
保存它(将列索引转换为行,因此将其列索引名称设置为行索引值)
df.T.reset_index().T.to_csv("DataFrame.csv")
And then read it with either of these two, first one being (very) slightly faster:
然后用这两个中的任何一个阅读它,第一个是(非常)稍微快一点:
df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read.columns = df_read.loc['cols']
df_read = df_read.drop('cols', axis=0)
or
或者
df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read = df_read.T.set_index('cols').T
回答by ehhh
header=True
(for the column names) and index=True
(for the index values) in df.to_csv
should do the trick. Both are True
by default, say the docs. However, you will have to manually set the index after loading, it is being loaded as a regular column. For this, I have added a name to the index:
header=True
(对于列名)和index=True
(对于索引值) indf.to_csv
应该可以解决问题。两者都是True
默认的,比如文档。但是,您必须在加载后手动设置索引,它是作为常规列加载的。为此,我在索引中添加了一个名称:
import numpy as np
import pandas as pd
# your code:
I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
index=I,
columns=C)
# name the index
I.name = 'index'
print "original:"
print df
print ""
df.to_csv("~/test.csv", index=True, header=True)
df2 = pd.read_csv("~/test.csv").set_index('index')
print "read from disk:"
print df2
print ""
Output:
输出:
original:
col0 col1 col2
index
a 0.455378 0.830872 0.495953
b 0.707146 0.832009 0.112540
c 0.894997 0.156364 0.521047
d 0.775462 0.482554 0.578177
read from disk:
col0 col1 col2
index
a 0.455378 0.830872 0.495953
b 0.707146 0.832009 0.112540
c 0.894997 0.156364 0.521047
d 0.775462 0.482554 0.578177