Python 如何在pandas DataFrame中按列设置dtypes
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25610592/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to set dtypes by column in pandas DataFrame
提问by Chris
I want to bring some data into a pandas DataFrame and I want to assign dtypes for each column on import. I want to be able to do this for larger datasets with many different columns, but, as an example:
我想将一些数据带入 Pandas DataFrame,并且我想在导入时为每列分配 dtypes。我希望能够对具有许多不同列的较大数据集执行此操作,但是,例如:
myarray = np.random.randint(0,5,size=(2,2))
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype=[float,int])
mydf.dtypes
results in:
结果是:
TypeError: data type not understood
类型错误:无法理解数据类型
I tried a few other methods such as:
我尝试了其他一些方法,例如:
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int})
TypeError: object of type 'type' has no len()
类型错误:“type”类型的对象没有 len()
If I put dtype=(float,int)it applies a float format to both columns.
如果我把dtype=(float,int)它应用于两列的浮点格式。
In the end I would like to just be able to pass it a list of datatypes the same way I can pass it a list of column names.
最后,我希望能够向它传递一个数据类型列表,就像我向它传递一个列名列表一样。
采纳答案by user545424
As of pandas version 0.24.2 (the current stable release) it is not possible to pass an explicit list of datatypes to the DataFrame constructor as the docs state:
从 pandas 0.24.2 版(当前稳定版本)开始,无法将显式数据类型列表传递给 DataFrame 构造函数,如文档所述:
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer
However, the dataframe class does have a static method allowing you to convert a numpy structured array to a dataframe so you can do:
但是,数据帧类确实有一个静态方法,允许您将 numpy 结构化数组转换为数据帧,因此您可以执行以下操作:
>>> myarray = np.random.randint(0,5,size=(2,2))
>>> record = np.array(map(tuple,myarray),dtype=[('a',np.float),('b',np.int)])
>>> mydf = pd.DataFrame.from_records(record)
>>> mydf.dtypes
a float64
b int64
dtype: object
回答by mattexx
I just ran into this, and the pandas issue is still open, so I'm posting my workaround. Assuming dfis my DataFrame and dtypeis a dict mapping column names to types:
我刚刚遇到了这个,熊猫问题仍然存在,所以我发布了我的解决方法。假设df是我的 DataFrame 并且dtype是一个将列名映射到类型的字典:
for k, v in dtype.items():
df[k] = df[k].astype(v)
(note: use dtype.iteritems()in python 2)
(注意:dtype.iteritems()在python 2中使用)
For the reference:
供参考:
- The list of allowed data types (NumPy
dtypes): https://docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html - Pandas also supports some other types. E.g.,
category: http://pandas.pydata.org/pandas-docs/stable/categorical.html - The relevant GitHub issue: https://github.com/pandas-dev/pandas/issues/9287
- 允许的数据类型列表(NumPy
dtypes):https: //docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html - Pandas 还支持其他一些类型。例如
category:http: //pandas.pydata.org/pandas-docs/stable/categorical.html - 相关的 GitHub 问题:https: //github.com/pandas-dev/pandas/issues/9287
回答by DBCerigo
You may want to try passing in a dictionary of Seriesobjects to the DataFrameconstructor - it will give you much more specific control over the creation, and should hopefully be clearer what's going on. A template version (data1can be an array etc.):
您可能想尝试将Series对象字典传递给DataFrame构造函数 - 它会让您对创建进行更具体的控制,并且希望应该更清楚发生了什么。模板版本(data1可以是数组等):
df = pd.DataFrame({'column1':pd.Series(data1, dtype='type1'),
'column2':pd.Series(data2, dtype='type2')})
And example with data:
和数据示例:
df = pd.DataFrame({'A':pd.Series([1,2,3], dtype='int'),
'B':pd.Series([7,8,9], dtype='float')})
print (df)
A B
0 1 7.0
1 2 8.0
2 3 9.0
print (df.dtypes)
A int32
B float64
dtype: object
回答by user10983117
while working with data types, they should be passed as strings.
在处理数据类型时,它们应该作为字符串传递。
For example the latter method you followed should be modified as
例如,您遵循的后一种方法应修改为
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': 'int'})
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': 'int'})
instead of
代替
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int}).
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int}).
The dtype (int, float etc.)should be given as strings.
本dtype (int, float etc.)应给予的字符串。
Or else as an Alternative method (iff you don't want to pass as strings)
import numpy as npand use
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': np.int})
或者作为替代方法(如果您不想作为字符串传递)
将 numpy 导入为 np并使用
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': np.int})

