Python 确定 Pandas 列数据类型

Question

提问by code base 5000

Sometimes when data is imported to Pandas Dataframe, it always imports as type object. This is fine and well for doing most operations, but I am trying to create a custom export function, and my question is this:

有时，当数据导入 Pandas Dataframe 时，它总是作为 type 导入object。这对于大多数操作来说都很好，但我正在尝试创建一个自定义导出函数，我的问题是：

Is there a way to force Pandas to infer the data types of the input data?
If not, is there a way after the data is loaded to infer the data types somehow?

有没有办法强制 Pandas 推断输入数据的数据类型？
如果没有，在加载数据后有没有办法以某种方式推断数据类型？

I know I can tell Pandas that this is of type int, str, etc.. but I don't want to do that, I was hoping pandas could be smart enough to know all the data types when a user imports or adds a column.

我知道我可以告诉 Pandas 这是 int、str 等类型。但我不想那样做，我希望当用户导入或添加列时，pandas 可以足够聪明以了解所有数据类型.

EDIT - example of import

编辑 - 导入示例

a = ['a']
col = ['somename']
df = pd.DataFrame(a, columns=col)
print(df.dtypes)
>>> somename    object
dtype: object

The type should be string?

类型应该是字符串？

Answer 1

回答by lmo

This is only a partial answer, but you can get frequency counts of the data type of the elements in a variable over the entire DataFrame as follows:

这只是部分答案，但您可以获得整个 DataFrame 中变量中元素的数据类型的频率计数，如下所示：

dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]

This returns

这返回

dtypeCount

[<class 'numpy.int32'>    4
 Name: a, dtype: int64,
 <class 'int'>    2
 <class 'str'>    2
 Name: b, dtype: int64,
 <class 'numpy.int32'>    4
 Name: c, dtype: int64]

It doesn't print this nicely, but you can pull out information for any variable by location:

它不能很好地打印出来，但是您可以按位置提取任何变量的信息：

dtypeCount[1]

<class 'int'>    2
<class 'str'>    2
Name: b, dtype: int64

which should get you started in finding what data types are causing the issue and how many of them there are.

这应该让您开始查找导致问题的数据类型以及其中有多少数据类型。

You can then inspect the rows that have a str object in the second variable using

然后，您可以使用检查在第二个变量中具有 str 对象的行

df[df.iloc[:,1].map(lambda x: type(x) == str)]

   a  b  c
1  1  n  4
3  3  g  6

data

数据

df = DataFrame({'a': range(4),
                'b': [6, 'n', 7, 'g'],
                'c': range(3, 7)})

Answer 2

回答by shahar_m

You can also infer the objects from after dropping irrelevant items by using infer_objects(). Below is a general example.

您还可以使用infer_objects(). 下面是一个通用示例。

df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]})
df = df_orig.iloc[1:].infer_objects()
print(df_orig.dtypes, df.dtypes, sep='\n\n')

Output:

输出：

Answer 3

回答by MisterMonk

Here an (not perfect) try to write an better inferer. When you have allready data in your dataframe, the inferer will guess the smallet type possible. Datetime is currently missing, but I think it could be an starting point. With this inferer, i can get down 70% of the memory in use.

这里有一个（不完美的）尝试编写一个更好的推理器。当您的数据框中已有数据时，推断器将猜测可能的 smallet 类型。当前缺少日期时间，但我认为这可能是一个起点。使用此推理器，我可以减少 70% 的内存使用量。

def infer_df(df, hard_mode=False, float_to_int=False, mf=None):
    ret = {}

    # ToDo: How much does auto convertion cost
    # set multiplication factor
    mf = 1 if hard_mode else 0.5

    # set supported datatyp
    integers = ['int8', 'int16', 'int32', 'int64']
    floats = ['float16', 'float32', 'float64']

    # ToDo: Unsigned Integer

    # generate borders for each datatype
    b_integers = [(np.iinfo(i).min, np.iinfo(i).max, i) for i in integers]
    b_floats = [(np.finfo(f).min, np.finfo(f).max, f) for f in floats]

    for c in df.columns:
        _type = df[c].dtype

        # if a column is set to float, but could be int
        if float_to_int and np.issubdtype(_type, np.floating):
            if np.sum(np.remainder(df[c], 1)) == 0:
                df[c] = df[c].astype('int64')
                _type = df[c].dtype

        # convert type of column to smallest possible
        if np.issubdtype(_type, np.integer) or np.issubdtype(_type, np.floating):
            borders = b_integers if np.issubdtype(_type, np.integer) else b_floats

            _min = df[c].min()
            _max = df[c].max()

            for b in borders:
                if b[0] * mf < _min and _max < b[1] * mf:
                    ret[c] = b[2]
                    break

        if _type == 'object' and len(df[c].unique()) / len(df) < 0.1:
            ret[c] = 'category'

    return ret

Python 确定 Pandas 列数据类型

提问by code base 5000

回答by lmo

回答by shahar_m

回答by MisterMonk

相关推荐

最近更新

标签

Python 确定 Pandas 列数据类型

提问by code base 5000

回答by lmo

回答by shahar_m

回答by MisterMonk

相关推荐

Python 如何从字符串中提取第一个和最后一个单词？

Python Pandas - Drop 函数错误（轴中不包含标签）

Python-位置参数跟随关键字参数

Python 熊猫 to_csv：ascii 无法编码字符

相关推荐

最近更新

标签