Python 如何在pandas DataFrame中按列设置dtypes

Question

提问by Chris

I want to bring some data into a pandas DataFrame and I want to assign dtypes for each column on import. I want to be able to do this for larger datasets with many different columns, but, as an example:

我想将一些数据带入 Pandas DataFrame，并且我想在导入时为每列分配 dtypes。我希望能够对具有许多不同列的较大数据集执行此操作，但是，例如：

myarray = np.random.randint(0,5,size=(2,2))
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype=[float,int])
mydf.dtypes

results in:

结果是：

TypeError: data type not understood

类型错误：无法理解数据类型

I tried a few other methods such as:

我尝试了其他一些方法，例如：

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int})

TypeError: object of type 'type' has no len()

类型错误：“type”类型的对象没有 len()

If I put dtype=(float,int)it applies a float format to both columns.

如果我把dtype=(float,int)它应用于两列的浮点格式。

In the end I would like to just be able to pass it a list of datatypes the same way I can pass it a list of column names.

最后，我希望能够向它传递一个数据类型列表，就像我向它传递一个列名列表一样。

Answer 1

采纳答案by user545424

As of pandas version 0.24.2 (the current stable release) it is not possible to pass an explicit list of datatypes to the DataFrame constructor as the docs state:

从 pandas 0.24.2 版（当前稳定版本）开始，无法将显式数据类型列表传递给 DataFrame 构造函数，如文档所述：

dtype : dtype, default None

    Data type to force. Only a single dtype is allowed. If None, infer

However, the dataframe class does have a static method allowing you to convert a numpy structured array to a dataframe so you can do:

但是，数据帧类确实有一个静态方法，允许您将 numpy 结构化数组转换为数据帧，因此您可以执行以下操作：

>>> myarray = np.random.randint(0,5,size=(2,2))
>>> record = np.array(map(tuple,myarray),dtype=[('a',np.float),('b',np.int)])
>>> mydf = pd.DataFrame.from_records(record)
>>> mydf.dtypes
a    float64
b      int64
dtype: object

Answer 2

回答by mattexx

I just ran into this, and the pandas issue is still open, so I'm posting my workaround. Assuming dfis my DataFrame and dtypeis a dict mapping column names to types:

我刚刚遇到了这个，熊猫问题仍然存在，所以我发布了我的解决方法。假设df是我的 DataFrame 并且dtype是一个将列名映射到类型的字典：

for k, v in dtype.items():
    df[k] = df[k].astype(v)

(note: use dtype.iteritems()in python 2)

（注意：dtype.iteritems()在python 2中使用）

For the reference:

供参考：

The list of allowed data types (NumPy dtypes): https://docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html
Pandas also supports some other types. E.g., category: http://pandas.pydata.org/pandas-docs/stable/categorical.html
The relevant GitHub issue: https://github.com/pandas-dev/pandas/issues/9287

允许的数据类型列表（NumPy dtypes）：https: //docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html
Pandas 还支持其他一些类型。例如category：http: //pandas.pydata.org/pandas-docs/stable/categorical.html
相关的 GitHub 问题：https: //github.com/pandas-dev/pandas/issues/9287

Answer 3

回答by DBCerigo

You may want to try passing in a dictionary of Seriesobjects to the DataFrameconstructor - it will give you much more specific control over the creation, and should hopefully be clearer what's going on. A template version (data1can be an array etc.):

您可能想尝试将Series对象字典传递给DataFrame构造函数 - 它会让您对创建进行更具体的控制，并且希望应该更清楚发生了什么。模板版本（data1可以是数组等）：

df = pd.DataFrame({'column1':pd.Series(data1, dtype='type1'),
                   'column2':pd.Series(data2, dtype='type2')})

And example with data:

和数据示例：

df = pd.DataFrame({'A':pd.Series([1,2,3], dtype='int'),
                   'B':pd.Series([7,8,9], dtype='float')})

print (df)
   A  B
0  1  7.0
1  2  8.0
2  3  9.0

print (df.dtypes)
A     int32
B    float64
dtype: object

Answer 4

回答by user10983117

while working with data types, they should be passed as strings.

在处理数据类型时，它们应该作为字符串传递。

For example the latter method you followed should be modified as

例如，您遵循的后一种方法应修改为

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': 'int'})

instead of

代替

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int}).

The dtype (int, float etc.)should be given as strings.

本dtype (int, float etc.)应给予的字符串。

Or else as an Alternative method (iff you don't want to pass as strings) import numpy as npand use mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': np.int})

或者作为替代方法（如果您不想作为字符串传递） 将 numpy 导入为 np并使用 mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': np.int})

Python 如何在pandas DataFrame中按列设置dtypes

提问by Chris

采纳答案by user545424

回答by mattexx

回答by DBCerigo

回答by user10983117

相关推荐

最近更新

标签

Python 如何在pandas DataFrame中按列设置dtypes

提问by Chris

采纳答案by user545424

回答by mattexx

回答by DBCerigo

回答by user10983117

相关推荐

将 Python 列表转换为 Pandas 系列

python脚本中的awk命令

Python 使用包含多种类型的 numpy 数组创建 Pandas DataFrame

Python - 解析 JSON 数据集

相关推荐

最近更新

标签