Python 将元信息/元数据添加到 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14688306/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Adding meta-information/metadata to pandas DataFrame
提问by P3trus
Is it possible to add some meta-information/metadata to a pandas DataFrame?
是否可以向 Pandas DataFrame 添加一些元信息/元数据?
For example, the instrument's name used to measure the data, the instrument responsible, etc.
例如,用于测量数据的仪器名称、负责的仪器等。
One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!
一种解决方法是创建一个包含该信息的列,但在每一行中存储一条信息似乎很浪费!
采纳答案by unutbu
Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:
当然,像大多数 Python 对象一样,您可以将新属性附加到 a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'
Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, joinor locto name just a few) may return a new DataFrame withoutthe metadata attached. Pandas does not yet have a robust method of propagatingmetadata attached to DataFrames.
但是请注意,虽然可以附加属性的数据帧,操作上数据帧进行(如groupby,pivot,join或loc仅举几例)可能会返回一个新的数据帧没有连接的元数据。Pandas 还没有一种强大的方法来传播附加到 DataFrames 的元数据。
Preserving the metadata in a fileis possible. You can find an example of how to store metadata in an HDF5 file here.
可以在文件中保留元数据。您可以在此处找到有关如何在 HDF5 文件中存储元数据的示例。
回答by Matti John
Not really. Although you could add attributes containing metadata to the DataFrame class as @unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485
并不真地。尽管您可以像@unutbu 提到的那样将包含元数据的属性添加到 DataFrame 类,但许多 DataFrame 方法返回一个新的 DataFrame,因此您的元数据将会丢失。如果您需要操作数据帧,那么最好的选择是将元数据和数据帧包装在另一个类中。在 GitHub 上查看此讨论:https: //github.com/pydata/pandas/issues/2485
There is currently an open pull requestto add a MetaDataFrame object, which would support metadata better.
目前有一个开放的pull request添加一个 MetaDataFrame 对象,这将更好地支持元数据。
回答by follyroof
Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).
刚刚自己遇到了这个问题。从 pandas 0.13 开始,DataFrames 上有一个 _metadata 属性,该属性通过返回新 DataFrames 的函数持续存在。似乎也能在序列化中幸存下来(我只试过 json,但我想 hdf 也被覆盖了)。
回答by choldgraf
Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5iothat I've been using to accomplish this.
迟到了,我认为如果您需要元数据在 I/O 上持久化,这可能会有所帮助。有一个名为h5io的相对较新的软件包,我一直在使用它来完成此任务。
It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:
它应该可以让您从 HDF5 快速读取/写入一些常见格式,其中之一是数据帧。因此,例如,您可以将数据框放入字典中,并将元数据作为字段包含在字典中。例如:
save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']
etc...
Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.
另一种选择是查看像xray这样的项目,它在某些方面更复杂,但我认为它确实允许您使用元数据并且很容易转换为 DataFrame。
回答by Dennis Golomazov
As mentioned in other answers and comments, _metadatais not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):
正如其他答案和评论中所述,_metadata它不是公共 API 的一部分,因此在生产环境中使用它绝对不是一个好主意。但是您可能仍然希望在研究原型中使用它,并在它停止工作时更换它。现在它与groupby/ 一起使用apply,这很有帮助。这是一个例子(我在其他答案中找不到):
df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val'])
df.my_attribute = "my_value"
df._metadata.append('my_attribute')
df.groupby('val').apply(lambda group: group.my_attribute)
Output:
输出:
val
1 my_value
2 my_value
3 my_value
dtype: object
回答by jtwilson
As mentioned by @choldgraf I have found xarrayto be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.
正如@choldgraf 提到的,我发现xarray是一个很好的工具,用于在多个数据帧之间比较数据和绘制结果时附加元数据。
In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:
在我的工作中,我们经常比较几个固件版本和不同测试场景的结果,添加这些信息就像这样简单:
df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata
回答by bscan
The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.
将任意属性附加到 DataFrame 对象的最佳答案是好的,但是如果您使用字典、列表或元组,它将发出错误“Pandas 不允许通过新的属性名称创建列”。以下解决方案适用于存储任意属性。
from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace()
df.meta.foo = [1,2,3]
回答by SenAnan
I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:
我遇到了同样的问题,并使用了一种从带有元数据的字典创建一个新的、更小的 DF 的解决方法:
meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
dfMeta = pd.DataFrame.from_dict(meta, orient='index')
This dfMeta can then be saved alongside your original DF in pickle etc
然后可以将此dfMeta与pickle等中的原始DF一起保存
See Saving and loading multiple objects in pickle file?(Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle
请参阅在泡菜文件中保存和加载多个对象?(Lutz 的回答)对于使用 pickle 保存和检索多个数据帧的出色回答
回答by ryanjdillon
As of pandas 1.0, possibly earlier, there is now a Dataframe.attrsproperty. It is experimental, but this is probably what you'll want in the future.
从 pandas 1.0 开始,可能更早,现在有一个Dataframe.attrs属性。这是实验性的,但这可能是您将来想要的。
Find it in the docs here.
在此处的文档中找到它。
Trying this out with to_parquetand then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.
尝试使用to_parquet然后from_parquet,它似乎不会持续存在,所以一定要检查你的用例。
回答by Ayrat Arifullin
I have been looking for a solution and found that pandas frame has the property attrs
我一直在寻找解决方案,发现pandas框架具有该属性 attrs
pd.DataFrame().attrs.update({'your_attribute' : 'value'})
frame.attrs['your_attribute']
This attribute will always stick to your frame whenever you pass it!
每当您通过它时,此属性将始终粘在您的框架上!

