Python 将元信息/元数据添加到 Pandas DataFrame

Question

提问by P3trus

Is it possible to add some meta-information/metadata to a pandas DataFrame?

是否可以向 Pandas DataFrame 添加一些元信息/元数据？

For example, the instrument's name used to measure the data, the instrument responsible, etc.

例如，用于测量数据的仪器名称、负责的仪器等。

One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!

一种解决方法是创建一个包含该信息的列，但在每一行中存储一条信息似乎很浪费！

Answer 1

采纳答案by unutbu

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:

当然，像大多数 Python 对象一样，您可以将新属性附加到 a pandas.DataFrame：

import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, joinor locto name just a few) may return a new DataFrame withoutthe metadata attached. Pandas does not yet have a robust method of propagatingmetadata attached to DataFrames.

但是请注意，虽然可以附加属性的数据帧，操作上数据帧进行（如groupby，pivot，join或loc仅举几例）可能会返回一个新的数据帧没有连接的元数据。Pandas 还没有一种强大的方法来传播附加到 DataFrames 的元数据。

Preserving the metadata in a fileis possible. You can find an example of how to store metadata in an HDF5 file here.

可以在文件中保留元数据。您可以在此处找到有关如何在 HDF5 文件中存储元数据的示例。

Answer 2

回答by Matti John

Not really. Although you could add attributes containing metadata to the DataFrame class as @unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485

并不真地。尽管您可以像@unutbu 提到的那样将包含元数据的属性添加到 DataFrame 类，但许多 DataFrame 方法返回一个新的 DataFrame，因此您的元数据将会丢失。如果您需要操作数据帧，那么最好的选择是将元数据和数据帧包装在另一个类中。在 GitHub 上查看此讨论：https: //github.com/pydata/pandas/issues/2485

There is currently an open pull requestto add a MetaDataFrame object, which would support metadata better.

目前有一个开放的pull request添加一个 MetaDataFrame 对象，这将更好地支持元数据。

Answer 3

回答by follyroof

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

刚刚自己遇到了这个问题。从 pandas 0.13 开始，DataFrames 上有一个 _metadata 属性，该属性通过返回新 DataFrames 的函数持续存在。似乎也能在序列化中幸存下来（我只试过 json，但我想 hdf 也被覆盖了）。

Answer 4

回答by choldgraf

Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5iothat I've been using to accomplish this.

迟到了，我认为如果您需要元数据在 I/O 上持久化，这可能会有所帮助。有一个名为h5io的相对较新的软件包，我一直在使用它来完成此任务。

It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:

它应该可以让您从 HDF5 快速读取/写入一些常见格式，其中之一是数据帧。因此，例如，您可以将数据框放入字典中，并将元数据作为字段包含在字典中。例如：

save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']
etc...

Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.

另一种选择是查看像xray这样的项目，它在某些方面更复杂，但我认为它确实允许您使用元数据并且很容易转换为 DataFrame。

Answer 5

回答by Dennis Golomazov

As mentioned in other answers and comments, _metadatais not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):

正如其他答案和评论中所述，_metadata它不是公共 API 的一部分，因此在生产环境中使用它绝对不是一个好主意。但是您可能仍然希望在研究原型中使用它，并在它停止工作时更换它。现在它与groupby/ 一起使用apply，这很有帮助。这是一个例子（我在其他答案中找不到）：

df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val']) 
df.my_attribute = "my_value"
df._metadata.append('my_attribute')
df.groupby('val').apply(lambda group: group.my_attribute)

Output:

输出：

val
1    my_value
2    my_value
3    my_value
dtype: object

Answer 6

回答by jtwilson

As mentioned by @choldgraf I have found xarrayto be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.

正如@choldgraf 提到的，我发现xarray是一个很好的工具，用于在多个数据帧之间比较数据和绘制结果时附加元数据。

In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:

在我的工作中，我们经常比较几个固件版本和不同测试场景的结果，添加这些信息就像这样简单：

df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata

Answer 7

回答by bscan

The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.

将任意属性附加到 DataFrame 对象的最佳答案是好的，但是如果您使用字典、列表或元组，它将发出错误“Pandas 不允许通过新的属性名称创建列”。以下解决方案适用于存储任意属性。

from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace()
df.meta.foo = [1,2,3]

Answer 8

回答by SenAnan

I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:

我遇到了同样的问题，并使用了一种从带有元数据的字典创建一个新的、更小的 DF 的解决方法：

    meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
    dfMeta = pd.DataFrame.from_dict(meta, orient='index')

This dfMeta can then be saved alongside your original DF in pickle etc

然后可以将此dfMeta与pickle等中的原始DF一起保存

See Saving and loading multiple objects in pickle file?(Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle

请参阅在泡菜文件中保存和加载多个对象？（Lutz 的回答）对于使用 pickle 保存和检索多个数据帧的出色回答

Answer 9

回答by ryanjdillon

As of pandas 1.0, possibly earlier, there is now a Dataframe.attrsproperty. It is experimental, but this is probably what you'll want in the future.

从 pandas 1.0 开始，可能更早，现在有一个Dataframe.attrs属性。这是实验性的，但这可能是您将来想要的。

Find it in the docs here.

在此处的文档中找到它。

Trying this out with to_parquetand then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.

尝试使用to_parquet然后from_parquet，它似乎不会持续存在，所以一定要检查你的用例。

Answer 10

回答by Ayrat Arifullin

I have been looking for a solution and found that pandas frame has the property attrs

我一直在寻找解决方案，发现pandas框架具有该属性 attrs

pd.DataFrame().attrs.update({'your_attribute' : 'value'})
frame.attrs['your_attribute']

This attribute will always stick to your frame whenever you pass it!

每当您通过它时，此属性将始终粘在您的框架上！

Python 将元信息/元数据添加到 Pandas DataFrame

提问by P3trus

采纳答案by unutbu

回答by Matti John

回答by follyroof

回答by choldgraf

回答by Dennis Golomazov

回答by jtwilson

回答by bscan

回答by SenAnan

回答by ryanjdillon

回答by Ayrat Arifullin

相关推荐

最近更新

标签

Python 将元信息/元数据添加到 Pandas DataFrame

提问by P3trus

采纳答案by unutbu

回答by Matti John

回答by follyroof

回答by choldgraf

回答by Dennis Golomazov

回答by jtwilson

回答by bscan

回答by SenAnan

回答by ryanjdillon

回答by Ayrat Arifullin

相关推荐

Python 在 Flask 中处理多个请求

Python Tkinter Label 小部件中的下划线文本？

Python 从openpyxl中的坐标值获取行号和列号

python pandas从时间序列中提取唯一的日期

相关推荐

最近更新

标签