Python Pandas 推断列数据类型
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35003138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas inferring column datatypes
提问by Calamari
I am reading JSON files into dataframes. The dataframe might have some String (object) type columns, some Numeric (int64 and/or float64), and some datetime type columns. When the data is read in, the datatype is often incorrect (ie datetime, int and float will often be stored as "object" type). I want to report on this possibility. (ie a column is in the dataframe as "object" (String), but it is actually a "datetime").
我正在将 JSON 文件读入数据帧。数据框可能有一些字符串(对象)类型的列、一些数字(int64 和/或 float64)和一些日期时间类型的列。读入数据时,数据类型经常不正确(即datetime、int和float通常会被存储为“对象”类型)。我想报告这种可能性。(即一列在数据框中作为“对象”(字符串),但它实际上是一个“日期时间”)。
The problem i have is that when i use pd.to_numericand pd.to_datetimethey will both evaluate and try to convert the column, and many times it ends up depending on which of the two I call last... (I was going to use convert_objects()which works but that is depreciated, so wanted a better option).
我遇到的问题是,当我使用pd.to_numeric和pd.to_datetime 时,它们都会评估并尝试转换列,并且很多时候它最终取决于我最后调用的两个中的哪一个......(我打算使用convert_objects()可以工作但已折旧,因此需要更好的选择)。
The code I am using to evaluate the dataframe column is (i realize a lot of the below is redundant, but I have written it this way for readability):
我用来评估数据框列的代码是(我意识到下面的很多都是多余的,但为了可读性,我是这样写的):
try:
inferred_type = pd.to_datetime(df[Field_Name]).dtype
if inferred_type == "datetime64[ns]":
inferred_type = "DateTime"
except:
pass
try:
inferred_type = pd.to_numeric(df[Field_Name]).dtype
if inferred_type == int:
inferred_type = "Integer"
if inferred_type == float:
inferred_type = "Float"
except:
pass
回答by PabTorre
I came across the same problem of having to figure out column types for incomming data where the type is not known beforehand... from a db read in my case. Couldn't find a good answer here on SO, or by reviewing pandas source code. Solved it using this function:
我遇到了同样的问题,必须找出传入数据的列类型,其中类型事先未知……从我的情况下读取的数据库中。在 SO 上或通过查看 Pandas 源代码找不到好的答案。使用这个函数解决了它:
def _get_col_dtype(col):
"""
Infer datatype of a pandas column, process only if the column dtype is object.
input: col: a pandas Series representing a df column.
"""
if col.dtype =="object":
# try numeric
try:
col_new = pd.to_datetime(col.dropna().unique())
return col_new.dtype
except:
try:
col_new = pd.to_numeric(col.dropna().unique())
return col_new.dtype
except:
try:
col_new = pd.to_timedelta(col.dropna().unique())
return col_new.dtype
except:
return "object"
else:
return col.dtype
回答by BeigeBruceWayne
Deep in the Pandas API there actually is a function that does a half decent job.
在 Pandas API 的深处,实际上有一个功能做得不错。
import pandas as pd
infer_type = lambda x: pd.api.types.infer_dtype(x, skipna=True)
df.apply(infer_type, axis=0)
# DataFrame with column names & new types
df_types = pd.DataFrame(df.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'})
Since
自从
The inference rules are the same as during normal Series/DataFrame construction.
推理规则与正常的 Series/DataFrame 构造期间相同。
consider to_numericfor int/floats
eg: df['amount'] = pd.to_numeric(df['amount'], errors='ignore')
考虑to_numeric用于整数/浮点数,
例如:df['amount'] = pd.to_numeric(df['amount'], errors='ignore')
回答by Daniel H
One solution to get it to infer dtypes is to get it to write the data to a CSV using StringIO
, and then read it back.
让它推断 dtypes 的一种解决方案是让它使用 将数据写入 CSV StringIO
,然后将其读回。
回答by Derrick Cheek
Alternatively: Pandas allows you to explicity define datatypes when creating a dataframe. You pass in a dictionary with column names as the key and the data type desired as the value.
或者:Pandas 允许您在创建数据帧时明确定义数据类型。您传入一个以列名作为键和所需数据类型作为值的字典。
Documentation Herefor the standard constructor
标准构造函数的 文档在这里
Or you can cast the column's type after importing into the data frame
或者您可以在导入数据框后转换列的类型
eg:
df['field_name'] = df['field_name'].astype(np.date_time)
例如:
df['field_name'] = df['field_name'].astype(np.date_time)
回答by zebralove79
Try e.g.
尝试例如
df['field_name'] = df['field_name'].astype(np.float64)
(assuming that import numpy as np
)
(假设import numpy as np
)
回答by Joe
Working off of BeigeBruceWayne's answer
解决 BeigeBruceWayne 的回答
df_types = pd.DataFrame(df_final.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'})
loop_types = df_types.values.tolist()
for col in loop_types:
if col[1] == 'mixed':
pass
else:
if col[1] == 'decimal':
data_type = 'float64'
elif col[1] == 'string':
data_type = 'str'
elif col[1] == 'integer':
data_type = 'int'
elif col[1] == 'floating':
data_type = 'float64'
elif col[1] == 'date':
data_type = 'datetime64'
else:
data_type = col[1]
df_final[col[0]] = df_final[col[0]].astype(data_type)