Python Pandas 推断列数据类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35003138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:34:00  来源:igfitidea点击:

Python Pandas inferring column datatypes

pythonpandasprofiling

提问by Calamari

I am reading JSON files into dataframes. The dataframe might have some String (object) type columns, some Numeric (int64 and/or float64), and some datetime type columns. When the data is read in, the datatype is often incorrect (ie datetime, int and float will often be stored as "object" type). I want to report on this possibility. (ie a column is in the dataframe as "object" (String), but it is actually a "datetime").

我正在将 JSON 文件读入数据帧。数据框可能有一些字符串(对象)类型的列、一些数字(int64 和/或 float64)和一些日期时间类型的列。读入数据时,数据类型经常不正确(即datetime、int和float通常会被存储为“对象”类型)。我想报告这种可能性。(即一列在数据框中作为“对象”(字符串),但它实际上是一个“日期时间”)。

The problem i have is that when i use pd.to_numericand pd.to_datetimethey will both evaluate and try to convert the column, and many times it ends up depending on which of the two I call last... (I was going to use convert_objects()which works but that is depreciated, so wanted a better option).

我遇到的问题是,当我使用pd.to_numericpd.to_datetime 时,它们都会评估并尝试转换列,并且很多时候它最终取决于我最后调用的两个中的哪一个......(我打算使用convert_objects()可以工作但已折旧,因此需要更好的选择)。

The code I am using to evaluate the dataframe column is (i realize a lot of the below is redundant, but I have written it this way for readability):

我用来评估数据框列的代码是(我意识到下面的很多都是多余的,但为了可读性,我是这样写的):

try:
   inferred_type = pd.to_datetime(df[Field_Name]).dtype
   if inferred_type == "datetime64[ns]":
      inferred_type = "DateTime"
except:
   pass
try:
   inferred_type = pd.to_numeric(df[Field_Name]).dtype
   if inferred_type == int:
      inferred_type = "Integer"
   if inferred_type == float:
      inferred_type = "Float"
except:
   pass

回答by PabTorre

I came across the same problem of having to figure out column types for incomming data where the type is not known beforehand... from a db read in my case. Couldn't find a good answer here on SO, or by reviewing pandas source code. Solved it using this function:

我遇到了同样的问题,必须找出传入数据的列类型,其中类型事先未知……从我的情况下读取的数据库中。在 SO 上或通过查看 Pandas 源代码找不到好的答案。使用这个函数解决了它:

def _get_col_dtype(col):
        """
        Infer datatype of a pandas column, process only if the column dtype is object. 
        input:   col: a pandas Series representing a df column. 
        """


        if col.dtype =="object":

            # try numeric
            try:
                col_new = pd.to_datetime(col.dropna().unique())
                return col_new.dtype
            except:
                try:
                    col_new = pd.to_numeric(col.dropna().unique())
                    return col_new.dtype
                except:
                    try:
                        col_new = pd.to_timedelta(col.dropna().unique())
                        return col_new.dtype
                    except:
                        return "object"

        else:
            return col.dtype

回答by BeigeBruceWayne

Deep in the Pandas API there actually is a function that does a half decent job.

在 Pandas API 的深处,实际上有一个功能做得不错。

import pandas as pd

infer_type = lambda x: pd.api.types.infer_dtype(x, skipna=True)
df.apply(infer_type, axis=0)


# DataFrame with column names & new types

df_types = pd.DataFrame(df.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'})

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html#pandas.api.types.infer_dtype

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html#pandas.api.types.infer_dtype

Since

自从

The inference rules are the same as during normal Series/DataFrame construction.

推理规则与正常的 Series/DataFrame 构造期间相同。

consider to_numericfor int/floats
eg: df['amount'] = pd.to_numeric(df['amount'], errors='ignore')

考虑to_numeric用于整数/浮点数,
例如:df['amount'] = pd.to_numeric(df['amount'], errors='ignore')

回答by Daniel H

One solution to get it to infer dtypes is to get it to write the data to a CSV using StringIO, and then read it back.

让它推断 dtypes 的一种解决方案是让它使用 将数据写入 CSV StringIO,然后将其读回。

回答by Derrick Cheek

Alternatively: Pandas allows you to explicity define datatypes when creating a dataframe. You pass in a dictionary with column names as the key and the data type desired as the value.

或者:Pandas 允许您在创建数据帧时明确定义数据类型。您传入一个以列名作为键和所需数据类型作为值的字典。

Documentation Herefor the standard constructor

标准构造函数的 文档在这里

Or you can cast the column's type after importing into the data frame

或者您可以在导入数据框后转换列的类型

eg: df['field_name'] = df['field_name'].astype(np.date_time)

例如: df['field_name'] = df['field_name'].astype(np.date_time)

回答by zebralove79

Try e.g.

尝试例如

df['field_name'] = df['field_name'].astype(np.float64)

(assuming that import numpy as np)

(假设import numpy as np

回答by Joe

Working off of BeigeBruceWayne's answer

解决 BeigeBruceWayne 的回答

df_types = pd.DataFrame(df_final.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'})

loop_types = df_types.values.tolist()

for col in loop_types:
    if col[1] == 'mixed':
        pass
    else:
        if col[1] == 'decimal':
            data_type = 'float64'
        elif col[1] == 'string':
            data_type = 'str'
        elif col[1] == 'integer':
            data_type = 'int'
        elif col[1] == 'floating':
            data_type = 'float64'
        elif col[1] == 'date':
            data_type = 'datetime64'
        else:
            data_type = col[1]
        df_final[col[0]] = df_final[col[0]].astype(data_type)