pandas:read_csv 如何强制 bool 数据为 dtype bool 而不是对象

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29739894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:13:35  来源:igfitidea点击:

pandas: read_csv how to force bool data to dtype bool instead of object

pythonpandas

提问by Prasanjit Prakash

I'm reading in a large flatfile which has timestamped data with multiple columns. Data has a boolean column which can be True/False or can have no entry(which evaluates to nan).

我正在读取一个带有多列时间戳数据的大型平面文件。数据有一个布尔列,可以是真/假,也可以没有条目(计算结果为 nan)。

When reading the csv the bool column gets typecast as object which prevents saving the data in hdfstore because of serialization error.

读取 csv 时, bool 列被类型转换为对象,这会由于序列化错误而阻止将数据保存在 hdfstore 中。

example data:

示例数据:

A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4

I use the following command to read

我使用以下命令阅读

import pandas as pd
pd.read_csv('data.csv', parse_dates=True)

One solution is to specify the dtype while reading in the csv but I was hoping for a more succinct solution like convert_objects where i can specify parse_numeric or parse_dates.

一种解决方案是在读取 csv 时指定 dtype,但我希望有一个更简洁的解决方案,例如 convert_objects,我可以在其中指定 parse_numeric 或 parse_dates。

采纳答案by EdChum

As you had a missing value in your csv the dtype of the columns is shown to be object as you have mixed dtypes, the first 3 row values are boolean, the last will be a float.

由于您的 csv 中有一个缺失值,列的 dtype 显示为对象,因为您有混合 dtypes,前 3 行值是布尔值,最后一个是浮点数。

To convert the NaNvalue use fillna, it accepts a dict to map desired fill values with columns and produce a homogenous dtype:

要转换NaN值 use fillna,它接受一个 dict 以将所需的填充值与列映射并生成同构 dtype:

In [9]:

t="""A    B    C    D
a    1    NaN    true
b    5    7    false
c    3    2    true
d    9    4"""
?
df = pd.read_csv(io.StringIO(t),sep='\s+')
?
df
Out[9]:
   A  B   C      D
0  a  1 NaN   True
1  b  5   7  False
2  c  3   2   True
3  d  9   4    NaN
In [11]:

df.fillna({'C':0, 'D':False})
Out[11]:
   A  B  C      D
0  a  1  0   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

回答by Anzel

You can use dtype, it accepts a dictionary for mapping columns:

您可以使用dtype,它接受用于映射列的字典:

dtype : Type name or dict of column -> type
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
dtype : Type name or dict of column -> type
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
import pandas as pd
import numpy as np
import io

# using your sample
csv_file = io.BytesIO('''
A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4''')

df = pd.read_csv(csv_file, sep=r'\s+', dtype={'D': np.bool})
# then fillna to convert NaN to False
df = df.fillna(value=False)

df 
   A  B  C      D
0  a  1  2   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

df.D.dtypes
dtype('bool')