pandas：read_csv 如何强制 bool 数据为 dtype bool 而不是对象

Question

提问by Prasanjit Prakash

I'm reading in a large flatfile which has timestamped data with multiple columns. Data has a boolean column which can be True/False or can have no entry(which evaluates to nan).

我正在读取一个带有多列时间戳数据的大型平面文件。数据有一个布尔列，可以是真/假，也可以没有条目（计算结果为 nan）。

When reading the csv the bool column gets typecast as object which prevents saving the data in hdfstore because of serialization error.

读取 csv 时， bool 列被类型转换为对象，这会由于序列化错误而阻止将数据保存在 hdfstore 中。

example data:

示例数据：

A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4

I use the following command to read

我使用以下命令阅读

import pandas as pd
pd.read_csv('data.csv', parse_dates=True)

One solution is to specify the dtype while reading in the csv but I was hoping for a more succinct solution like convert_objects where i can specify parse_numeric or parse_dates.

一种解决方案是在读取 csv 时指定 dtype，但我希望有一个更简洁的解决方案，例如 convert_objects，我可以在其中指定 parse_numeric 或 parse_dates。

Answer 1

采纳答案by EdChum

As you had a missing value in your csv the dtype of the columns is shown to be object as you have mixed dtypes, the first 3 row values are boolean, the last will be a float.

由于您的 csv 中有一个缺失值，列的 dtype 显示为对象，因为您有混合 dtypes，前 3 行值是布尔值，最后一个是浮点数。

To convert the NaNvalue use fillna, it accepts a dict to map desired fill values with columns and produce a homogenous dtype:

要转换NaN值 use fillna，它接受一个 dict 以将所需的填充值与列映射并生成同构 dtype：

In [9]:

t="""A    B    C    D
a    1    NaN    true
b    5    7    false
c    3    2    true
d    9    4"""
?
df = pd.read_csv(io.StringIO(t),sep='\s+')
?
df
Out[9]:
   A  B   C      D
0  a  1 NaN   True
1  b  5   7  False
2  c  3   2   True
3  d  9   4    NaN
In [11]:

df.fillna({'C':0, 'D':False})
Out[11]:
   A  B  C      D
0  a  1  0   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

Answer 2

回答by Anzel

You can use dtype, it accepts a dictionary for mapping columns:

您可以使用dtype，它接受用于映射列的字典：

dtype : Type name or dict of column -> type
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}

dtype : Type name or dict of column -> type
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}

import pandas as pd
import numpy as np
import io

# using your sample
csv_file = io.BytesIO('''
A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4''')

df = pd.read_csv(csv_file, sep=r'\s+', dtype={'D': np.bool})
# then fillna to convert NaN to False
df = df.fillna(value=False)

df 
   A  B  C      D
0  a  1  2   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

df.D.dtypes
dtype('bool')

pandas：read_csv 如何强制 bool 数据为 dtype bool 而不是对象

提问by Prasanjit Prakash

采纳答案by EdChum

回答by Anzel

相关推荐

最近更新

标签

pandas：read_csv 如何强制 bool 数据为 dtype bool 而不是对象

提问by Prasanjit Prakash

采纳答案by EdChum

回答by Anzel

相关推荐

pandas 在网格中绘制多个直方图

pandas 熊猫读取没有标题的 csv（可能在那里）

Python Pandas to_pickle 不能腌制大型数据帧

快速Haversine近似（Python/Pandas）

相关推荐

最近更新

标签