Python Pandas 推断列数据类型

Question

提问by Calamari

I am reading JSON files into dataframes. The dataframe might have some String (object) type columns, some Numeric (int64 and/or float64), and some datetime type columns. When the data is read in, the datatype is often incorrect (ie datetime, int and float will often be stored as "object" type). I want to report on this possibility. (ie a column is in the dataframe as "object" (String), but it is actually a "datetime").

我正在将 JSON 文件读入数据帧。数据框可能有一些字符串（对象）类型的列、一些数字（int64 和/或 float64）和一些日期时间类型的列。读入数据时，数据类型经常不正确（即datetime、int和float通常会被存储为“对象”类型）。我想报告这种可能性。（即一列在数据框中作为“对象”（字符串），但它实际上是一个“日期时间”）。

The problem i have is that when i use pd.to_numericand pd.to_datetimethey will both evaluate and try to convert the column, and many times it ends up depending on which of the two I call last... (I was going to use convert_objects()which works but that is depreciated, so wanted a better option).

我遇到的问题是，当我使用pd.to_numeric和pd.to_datetime 时，它们都会评估并尝试转换列，并且很多时候它最终取决于我最后调用的两个中的哪一个......（我打算使用convert_objects()可以工作但已折旧，因此需要更好的选择）。

The code I am using to evaluate the dataframe column is (i realize a lot of the below is redundant, but I have written it this way for readability):

我用来评估数据框列的代码是（我意识到下面的很多都是多余的，但为了可读性，我是这样写的）：

try:
   inferred_type = pd.to_datetime(df[Field_Name]).dtype
   if inferred_type == "datetime64[ns]":
      inferred_type = "DateTime"
except:
   pass
try:
   inferred_type = pd.to_numeric(df[Field_Name]).dtype
   if inferred_type == int:
      inferred_type = "Integer"
   if inferred_type == float:
      inferred_type = "Float"
except:
   pass

Answer 1

回答by PabTorre

I came across the same problem of having to figure out column types for incomming data where the type is not known beforehand... from a db read in my case. Couldn't find a good answer here on SO, or by reviewing pandas source code. Solved it using this function:

我遇到了同样的问题，必须找出传入数据的列类型，其中类型事先未知……从我的情况下读取的数据库中。在 SO 上或通过查看 Pandas 源代码找不到好的答案。使用这个函数解决了它：

def _get_col_dtype(col):
        """
        Infer datatype of a pandas column, process only if the column dtype is object. 
        input:   col: a pandas Series representing a df column. 
        """


        if col.dtype =="object":

            # try numeric
            try:
                col_new = pd.to_datetime(col.dropna().unique())
                return col_new.dtype
            except:
                try:
                    col_new = pd.to_numeric(col.dropna().unique())
                    return col_new.dtype
                except:
                    try:
                        col_new = pd.to_timedelta(col.dropna().unique())
                        return col_new.dtype
                    except:
                        return "object"

        else:
            return col.dtype

Answer 2

回答by BeigeBruceWayne

Deep in the Pandas API there actually is a function that does a half decent job.

在 Pandas API 的深处，实际上有一个功能做得不错。

import pandas as pd

infer_type = lambda x: pd.api.types.infer_dtype(x, skipna=True)
df.apply(infer_type, axis=0)


# DataFrame with column names & new types

df_types = pd.DataFrame(df.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'})

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html#pandas.api.types.infer_dtype

Since

自从

The inference rules are the same as during normal Series/DataFrame construction.

推理规则与正常的 Series/DataFrame 构造期间相同。

consider to_numericfor int/floats
eg: df['amount'] = pd.to_numeric(df['amount'], errors='ignore')

考虑to_numeric用于整数/浮点数，
例如：df['amount'] = pd.to_numeric(df['amount'], errors='ignore')

Answer 3

回答by Daniel H

One solution to get it to infer dtypes is to get it to write the data to a CSV using StringIO, and then read it back.

让它推断 dtypes 的一种解决方案是让它使用将数据写入 CSV StringIO，然后将其读回。

Answer 4

回答by Derrick Cheek

Alternatively: Pandas allows you to explicity define datatypes when creating a dataframe. You pass in a dictionary with column names as the key and the data type desired as the value.

或者：Pandas 允许您在创建数据帧时明确定义数据类型。您传入一个以列名作为键和所需数据类型作为值的字典。

Documentation Herefor the standard constructor

标准构造函数的文档在这里

Or you can cast the column's type after importing into the data frame

或者您可以在导入数据框后转换列的类型

eg: df['field_name'] = df['field_name'].astype(np.date_time)

例如： df['field_name'] = df['field_name'].astype(np.date_time)

Answer 5

回答by zebralove79

Try e.g.

尝试例如

df['field_name'] = df['field_name'].astype(np.float64)

(assuming that import numpy as np)

（假设import numpy as np）

Answer 6

回答by Joe

Working off of BeigeBruceWayne's answer

解决 BeigeBruceWayne 的回答

df_types = pd.DataFrame(df_final.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'})

loop_types = df_types.values.tolist()

for col in loop_types:
    if col[1] == 'mixed':
        pass
    else:
        if col[1] == 'decimal':
            data_type = 'float64'
        elif col[1] == 'string':
            data_type = 'str'
        elif col[1] == 'integer':
            data_type = 'int'
        elif col[1] == 'floating':
            data_type = 'float64'
        elif col[1] == 'date':
            data_type = 'datetime64'
        else:
            data_type = col[1]
        df_final[col[0]] = df_final[col[0]].astype(data_type)

Python Pandas 推断列数据类型

提问by Calamari

回答by PabTorre

回答by BeigeBruceWayne

回答by Daniel H

回答by Derrick Cheek

回答by zebralove79

回答by Joe

相关推荐

最近更新

标签

Python Pandas 推断列数据类型

提问by Calamari

回答by PabTorre

回答by BeigeBruceWayne

回答by Daniel H

回答by Derrick Cheek

回答by zebralove79

回答by Joe

相关推荐

Python pandas 使用滚动应用到 groupby 对象以矢量化方式计算机车车辆 beta

pandas 如何在熊猫中找到重复项？

使用 Pandas 的每小时日期时间直方图

pandas 带有熊猫和 matplotlib 的条形图顶部的平均线

相关推荐

最近更新

标签