pandas 将数组或数据帧与其他信息一起保存在文件中
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49740190/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Saving in a file an array or DataFrame together with other information
提问by Pearly Spencer
The statistical software Stataallows short text snippets to be saved within a dataset. This is accomplished either using notes
and/or characteristics
.
统计软件Stata允许将短文本片段保存在数据集中。这是通过使用notes
和/或来完成的characteristics
。
This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.
这对我来说是一个很有价值的功能,因为它允许我保存各种信息,从提醒和待办事项列表到有关我如何生成数据的信息,甚至特定变量的估计方法是什么。
I am now trying to come up with a similar functionality in Python 3.6. So far, I have looked online and consulted a number of posts, which however do not exactly address what I want to do.
我现在试图在 Python 3.6 中提出类似的功能。到目前为止,我已经在网上查看并查阅了许多帖子,但是这些帖子并没有完全解决我想要做的事情。
A few reference posts include:
一些参考帖子包括:
What is the difference between save a pandas dataframe to pickle and to csv?
What is the fastest way to upload a big csv file in notebook to work with python pandas?
For a small NumPy
array, I have concluded that a combination of the function numpy.savez()
and a dictionary
can store adequately all relevant information in a single file.
对于一个小NumPy
数组,我得出结论,函数numpy.savez()
和 a的组合dictionary
可以将所有相关信息充分存储在单个文件中。
For example:
例如:
a = np.array([[2,4],[6,8],[10,12]])
d = {"first": 1, "second": "two", "third": 3}
np.savez(whatever_name.npz, a=a, d=d)
data = np.load(whatever_name.npz)
arr = data['a']
dic = data['d'].tolist()
However, the question remains:
然而,问题仍然存在:
Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy
array or a (large) Pandas
DataFrame
?
是否有更好的方法可以将其他信息合并到包含NumPy
数组或(大)的文件中Pandas
DataFrame
?
I am particularly interested in hearing about the particular prosand consof any suggestions you may have with examples. The fewer dependencies, the better.
我在听到有关特定特别感兴趣的优点和缺点,你可能有例子的任何建议。依赖越少越好。
采纳答案by jpp
There are many options. I will discuss only HDF5, because I have experience using this format.
有很多选择。我将只讨论 HDF5,因为我有使用这种格式的经验。
Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.
优点:便携(可以在 Python 之外读取)、本机压缩、内存不足功能、元数据支持。
Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.
缺点:依赖单一的低级C API,单个文件可能导致数据损坏,删除数据不会自动减小大小。
In my experience, for performance and portability, avoidpyTables
/ HDFStore
to store numeric data. You can instead use the intuitive interface provided by h5py
.
根据我的经验,为了性能和便携性,避免pyTables
/HDFStore
存储数字数据。您可以改用 提供的直观界面h5py
。
Store an array
存储数组
import h5py, numpy as np
arr = np.random.randint(0, 10, (1000, 1000))
f = h5py.File('file.h5', 'w', libver='latest') # use 'latest' for performance
dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
compression='gzip', compression_opts=9)
Compression & chunking
压缩和分块
There are many compression choices, e.g. blosc
and lzf
are good choices for compression and decompression performance respectively. Note gzip
is native; other compression filters may not ship by default with your HDF5 installation.
有许多压缩选择,例如blosc
和lzf
分别是压缩和解压缩性能的好选择。注意gzip
是原生的;默认情况下,其他压缩过滤器可能不会随您的 HDF5 安装一起提供。
Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.
分块是另一种选择,当与读取内存外数据的方式保持一致时,可以显着提高性能。
Add some attributes
添加一些属性
dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)
Store a dictionary
存储字典
for k, v in d.items():
f.create_dataset('dictgroup/'+str(k), data=v)
Out-of-memory access
内存不足访问
dictionary = f['dictgroup']
res = dictionary['my_key']
There is no substitute for reading the h5py
documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.
阅读h5py
文档是无可替代的,它公开了大部分 C API,但您应该从上面看到有很大的灵活性。
回答by Christian
A practical way could be to embed meta-data directly inside the Numpy array. The advantage is that, as you'd like, there's no extra dependency and it's very simple to use in the code. However, this doesn't fully answers your question, because you still need a mechanism to save the data, and I'd recommend using jpp's solution using HDF5.
一种实用的方法是将元数据直接嵌入 Numpy 数组中。优点是,如你所愿,没有额外的依赖,而且在代码中使用起来非常简单。但是,这并不能完全回答您的问题,因为您仍然需要一种机制来保存数据,我建议使用使用 HDF5 的jpp解决方案。
To include metadata in an ndarray
, there is an example in the documentation.
You basically have to subclass an ndarray
and add a field info
or metadata
or whatever.
要在 中包含元数据,文档中ndarray
有一个示例。你基本上有一个子类,并添加一个字段或或什么的。ndarray
info
metadata
It would give (code from the link above)
它会给(来自上面链接的代码)
import numpy as np
class ArrayWithInfo(np.ndarray):
def __new__(cls, input_array, info=None):
# Input array is an already formed ndarray instance
# We first cast to be our class type
obj = np.asarray(input_array).view(cls)
# add the new attribute to the created instance
obj.info = info
# Finally, we must return the newly created object:
return obj
def __array_finalize__(self, obj):
# see InfoArray.__array_finalize__ for comments
if obj is None: return
self.info = getattr(obj, 'info', None)
To save the data through numpy
, you'd need to overload the write
function or use another solution.
要通过 保存数据numpy
,您需要重载该write
函数或使用其他解决方案。
回答by tnknepp
I agree with JPP that hdf5 storage is a good option here. The difference between his solution and mine is mine uses Pandas dataframes instead of numpy arrays. I prefer the dataframe since this allows mixed types, multi-level indexing (even datetime indexing, which is VERY important for my work), and column labeling, which helps me remember how different datasets are organized. Also, Pandas provides a slew of built-in functionalities (much like numpy). Another benefit of using Pandas is it has a hdf creator built in (i.e. pandas.DataFrame.to_hdf), which I find convenient
我同意 JPP 的观点,即 hdf5 存储在这里是一个不错的选择。他的解决方案和我的解决方案之间的区别在于我使用 Pandas 数据帧而不是 numpy 数组。我更喜欢数据框,因为它允许混合类型、多级索引(甚至日期时间索引,这对我的工作非常重要)和列标签,这有助于我记住不同数据集的组织方式。此外,Pandas 提供了大量内置功能(很像 numpy)。使用 Pandas 的另一个好处是它内置了一个 hdf 创建器(即 pandas.DataFrame.to_hdf),我觉得这很方便
When storing the dataframe to h5 you have the option of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (I use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}. In this regard, there is no difference between using a numpy array and a dataframe. For the full solution see my original question and solution here.
将数据帧存储到 h5 时,您还可以选择存储元数据字典,这可以是您对自己的笔记,也可以是不需要存储在数据帧中的实际元数据(我也使用它来设置标志,例如 {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}。在这方面,使用 numpy 数组和数据框没有区别。完整的解决方案见 我的原始问题和解决方案在这里。
回答by Darren Brien
jpp's answer is pretty comprehensive, just wanted to mention that as of pandas v22 parquet is very convenient and fast option with almost no drawbacks vs csv (accept perhaps the coffee break).
jpp 的回答非常全面,只是想提一下,从 pandas v22 parquet 开始,parquet 是非常方便快捷的选项,与 csv 相比几乎没有任何缺点(也许可以接受咖啡休息时间)。
At time of writing you'll need to also
在撰写本文时,您还需要
pip install pyarrow
In terms of adding information you have the metadata which is attached to the data
在添加信息方面,您拥有附加到数据的元数据
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size=(1000, 10)))
tab = pa.Table.from_pandas(df)
tab = tab.replace_schema_metadata({'here' : 'it is'})
pq.write_table(tab, 'where_is_it.parq')
pq.read_table('where_is_it.parq')
Pyarrow table
0: double
1: double
2: double
3: double
4: double
5: double
6: double
7: double
8: double
9: double
__index_level_0__: int64
metadata
--------
{b'here': b'it is'}
To get this back to pandas:
要将其返回给Pandas:
tab.to_pandas()
回答by floydn
You stated as the reasons for this question:
你说这个问题的原因:
... it allows me to save a variety of information, ranging from reminders and to-do lists, to information about how i generated the data, or even what the estimation method for a particular variable was.
...它允许我保存各种信息,从提醒和待办事项列表到有关我如何生成数据的信息,甚至特定变量的估计方法是什么。
May I suggest a different paradigm than that offered by Stata? The notes and characteristics seem to be very limited and confined to just text. Instead, you should use Jupyter Notebookfor your research and data analysis projects. It provides such a rich environment to document your workflow and capture details, thoughts and ideas as you are doing your analysis and research. It can easily be shared, and it's presentation-ready.
我可以建议一种与 Stata 提供的范式不同的范式吗?注释和特征似乎非常有限并且仅限于文本。相反,您应该将Jupyter Notebook用于您的研究和数据分析项目。它提供了如此丰富的环境来记录您的工作流程并在您进行分析和研究时捕捉细节、想法和想法。它可以轻松共享,并且可以进行演示。
Here is a gallery of interesting Jupyter Notebooksacross many industries and disciplines to showcase the many features and use cases of notebooks. It may expand your horizons beyond trying to devise a way to tag simple snippets of text to your data.
这是一个有趣的 Jupyter Notebooks 库,涵盖了许多行业和学科,以展示Notebook的许多功能和用例。除了尝试设计一种方法将简单的文本片段标记到您的数据之外,它还可以扩展您的视野。
回答by WillMonge
It's an interesting question, although very open-ended I think.
这是一个有趣的问题,尽管我认为这是一个非常开放的问题。
Text Snippets
For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...
文本片段
对于具有文字注释的文本片段(例如,不是代码而不是数据),我真的不知道您的用例是什么,但我不明白为什么我会偏离使用通常的with open() as f: ...
Small collections of various data pieces
Sure, your npz
works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.
各种数据片段的小集合
当然,你的npz
作品。实际上,您正在做的事情与创建一个包含您想要保存的所有内容并酸洗该词典的词典非常相似。
See herefor a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).
有关pickle和npz之间差异的讨论,请参见此处(但主要是npz针对numpy数组进行了优化)。
Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes
class that is basically a dictionary to save stuff in it, with some additional functionality you may want.
就我个人而言,我会说,如果您不存储 Numpy 数组,我会使用 pickle,甚至实现一个MyNotes
基本上是字典的快速类来保存其中的内容,还有一些您可能需要的附加功能。
Collection of large objects
For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5()
. It does need underneath pytables
-installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.
大对象的集合
对于我在 HDF5 格式之前使用过的非常大的 np.arrays 或数据帧。好消息是它已经内置到 Pandas 中,您可以直接df.to_hdf5()
. 它确实需要pytables
底层 - 使用 pip 或 conda 安装应该相当轻松 - 但直接使用 pytables 可能会带来更大的痛苦。
Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).
同样,这个想法非常相似:您正在创建一个 HDFStore,它几乎是一个大字典,您可以在其中存储(几乎任何)对象。好处是该格式通过利用相似值的重复以更智能的方式利用空间。当我使用它来存储一些 ~2GB 的数据帧时,它能够将其减少几乎一个完整的数量级(~250MB)。
One last player: feather
Feather
is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.
最后一个参与者:feather
Feather
是由 Wes McKinney 和 Hadley Wickham 在 Apache Arrow 框架之上创建的一个项目,用于以与语言无关的二进制格式保存数据(因此您可以从 R 和 Python 中读取)。但是,它仍在开发中,上次我检查他们不鼓励将其用于长期存储(因为规范可能会在未来版本中更改),而不仅仅是将其用于 R 和 Python 之间的通信。
They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.