Python pandas.read_csv 中的 dtype 和转换器有什么区别？

Question

提问by Bryan

pandas function read_csv() reads a .csv file. Its documentation is here

pandas 函数 read_csv() 读取 .csv 文件。它的文档在这里

According to documentation, we know:

根据文档，我们知道：

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a': np.float64, ‘b': np.int32} (Unsupported with engine='python')

dtype ：类型名称或列的字典 -> 类型，默认无数据或列的数据类型。例如 {'a': np.float64, 'b': np.int32} （不支持 engine='python'）

and

和

converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels

转换器：字典，默认无用于转换某些列中值的函数字典。键可以是整数或列标签

When using this function, I can call either pandas.read_csv('file',dtype=object)or pandas.read_csv('file',converters=object). Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?

使用此函数时，我可以调用 pandas.read_csv('file',dtype=object)或pandas.read_csv('file',converters=object)。很明显，转换器，它的名字可以说数据类型将被转换，但我想知道dtype的情况？

Answer 1

采纳答案by EdChum

The semantic difference is that dtypeallows you to specify how to treat the values, for example, either as numeric or string type.

语义差异在于dtype允许您指定如何处理值，例如，作为数字类型或字符串类型。

Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.

转换器允许您解析输入数据以使用转换函数将其转换为所需的数据类型，例如，将字符串值解析为日期时间或其他所需的数据类型。

Here we see that pandas tries to sniff the types:

在这里我们看到熊猫试图嗅探类型：

In [2]:
df = pd.read_csv(io.StringIO(t))
t="""int,float,date,str
001,3.31,2015/01/01,005"""
df = pd.read_csv(io.StringIO(t))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null int64
float    1 non-null float64
date     1 non-null object
str      1 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 40.0+ bytes

You can see from the above that 001and 005are treated as int64but the date string stays as str.

您可以从上面看到001和005被视为int64但日期字符串保持为str.

If we say everything is objectthen essentially everything is str:

如果我们说一切都是object那么基本上一切都是str：

In [3]:    
df = pd.read_csv(io.StringIO(t), dtype=object).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null object
float    1 non-null object
date     1 non-null object
str      1 non-null object
dtypes: object(4)
memory usage: 40.0+ bytes

Here we force the intcolumn to strand tell parse_datesto use the date_parser to parse the date column:

这里我们强制int列str并告诉parse_dates使用 date_parser 来解析日期列：

In [6]:
pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null object
float    1 non-null float64
date     1 non-null datetime64[ns]
str      1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 40.0+ bytes

Similarly we could've pass the to_datetimefunction to convert the dates:

同样，我们可以传递to_datetime函数来转换日期：

In [5]:
pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null int64
float    1 non-null float64
date     1 non-null datetime64[ns]
str      1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 40.0 bytes

Python pandas.read_csv 中的 dtype 和转换器有什么区别？

提问by Bryan

采纳答案by EdChum

相关推荐

最近更新

标签

Python pandas.read_csv 中的 dtype 和转换器有什么区别？

提问by Bryan

采纳答案by EdChum

相关推荐

Python 如何在pyspark中更改数据框列名？

Python 如何将 Pandas 数据框/系列数据保存为图形？

Python 张量流中具有未指定维度的张量

串行导入python

相关推荐

最近更新

标签