pandas 如何为 dask.dataframe 指定元数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39265396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:55:52  来源:igfitidea点击:

How to specify metadata for dask.dataframe

pythonpandasdask

提问by Someone

The docs provide good examples, how metadata can be provided. However I still feel unsure, when it comes to picking the right dtypes for my dataframe.

文档提供了很好的示例,说明如何提供元数据。但是,在为我的数据框选择正确的 dtype 时,我仍然不确定。

  • Could I do something like meta={'x': int 'y': float, 'z': float}instead of meta={'x': 'i8', 'y': 'f8', 'z': 'f8'}?
  • Could somebody hint me to a list of possible values like 'i8'? What dtypes exist?
  • How can I specify a column, that contains arbitrary objects? How can I specify a column, that contains only instances of one class?
  • 我可以做类似的事情meta={'x': int 'y': float, 'z': float}而不是meta={'x': 'i8', 'y': 'f8', 'z': 'f8'}吗?
  • 有人可以提示我一个可能的值列表,比如'i8'吗?存在哪些数据类型?
  • 如何指定包含任意对象的列?如何指定仅包含一个类的实例的列?

采纳答案by sim

The available basic data types are the ones which are offered through numpy. Have a look at the documentationfor a list.

可用的基本数据类型是通过 numpy 提供的数据类型。查看文档以获取列表。

Not included in this set are datetime-formats (e.g. datetime64), for which additional information can be found in the pandasand numpydocumentation.

未包含在此集合中的是日期时间格式(例如datetime64),其附加信息可以在pandasnumpy文档中找到。

The meta-argument for dask dataframes usually expects an empty pandas dataframe holding definitions for columns, indices and dtypes.

dask 数据框的元参数通常需要一个空的 Pandas 数据框,其中包含列、索引和 dtype 的定义。

One way to construct such a DataFrame is:

构造这样的 DataFrame 的一种方法是:

import pandas as pd
import numpy as np
meta = pd.DataFrame(columns=['a', 'b', 'c'])
meta.a = meta.a.astype(np.int64)
meta.b = meta.b.astype(np.datetime64)

There is also a way to provide a dtype to the constructor of the pandas dataframe, however, I am not sure how to provide them for individual columns each. As you can see, it is possible to provide not only the "name" for datatypes, but also the actual numpy dtype.

还有一种方法可以为 pandas 数据框的构造函数提供 dtype,但是,我不确定如何为每个列提供它们。如您所见,不仅可以提供数据类型的“名称”,还可以提供实际的 numpy dtype。

Regarding your last question, the datatype you are looking for is "object". For example:

关于您的最后一个问题,您要查找的数据类型是“对象”。例如:

import pandas as pd

class Foo:
    def __init__(self, foo):
        self.bar = foo

df = pd.DataFrame(data=[Foo(1), Foo(2)], columns=['a'], dtype='object')
df.a
# 0    <__main__.Foo object at 0x00000000058AC550>
# 1    <__main__.Foo object at 0x00000000058AC358>

回答by MRocklin

Both Dask.dataframe and Pandas use NumPy dtypes. In particular, anything within that you can pass to np.dtype. This includes the following:

Dask.dataframe 和 Pandas 都使用 NumPy dtypes。特别是,您可以将其中的任何内容传递给np.dtype。这包括以下内容:

  1. NumPy dtype objects, like np.float64
  2. Python type objects, like float
  3. NumPy dtype strings, like 'f8'
  1. NumPy dtype 对象,例如 np.float64
  2. Python 类型对象,例如 float
  3. NumPy dtype 字符串,例如 'f8'

Here is a more extensive list taken from the NumPy docs: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#specifying-and-constructing-data-types

这是从 NumPy 文档中获取的更广泛的列表:http: //docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#specifying-and-constructing-data-types