Python pandas.read_csv 中的 dtype 和转换器有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34139102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:30:06  来源:igfitidea点击:

What's the difference between dtype and converters in pandas.read_csv?

pythonpandastypesconvertertype-inference

提问by Bryan

pandas function read_csv() reads a .csv file. Its documentation is here

pandas 函数 read_csv() 读取 .csv 文件。它的文档在这里

According to documentation, we know:

根据文档,我们知道:

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a': np.float64, ‘b': np.int32} (Unsupported with engine='python')

dtype :类型名称或列的字典 -> 类型,默认无数据或列的数据类型。例如 {'a': np.float64, 'b': np.int32} (不支持 engine='python')

and

converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels

转换器:字典,默认无用于转换某些列中值的函数字典。键可以是整数或列标签

When using this function, I can call either pandas.read_csv('file',dtype=object)or pandas.read_csv('file',converters=object). Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?

使用此函数时,我可以调用 pandas.read_csv('file',dtype=object)pandas.read_csv('file',converters=object)。很明显,转换器,它的名字可以说数据类型将被转换,但我想知道dtype的情况?

采纳答案by EdChum

The semantic difference is that dtypeallows you to specify how to treat the values, for example, either as numeric or string type.

语义差异在于dtype允许您指定如何处理值,例如,作为数字类型或字符串类型。

Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.

转换器允许您解析输入数据以使用转换函数将其转换为所需的数据类型,例如,将字符串值解析为日期时间或其他所需的数据类型。

Here we see that pandas tries to sniff the types:

在这里我们看到熊猫试图嗅探类型:

In [2]:
df = pd.read_csv(io.StringIO(t))
t="""int,float,date,str
001,3.31,2015/01/01,005"""
df = pd.read_csv(io.StringIO(t))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null int64
float    1 non-null float64
date     1 non-null object
str      1 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 40.0+ bytes

You can see from the above that 001and 005are treated as int64but the date string stays as str.

您可以从上面看到001005被视为int64但日期字符串保持为str.

If we say everything is objectthen essentially everything is str:

如果我们说一切都是object那么基本上一切都是str

In [3]:    
df = pd.read_csv(io.StringIO(t), dtype=object).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null object
float    1 non-null object
date     1 non-null object
str      1 non-null object
dtypes: object(4)
memory usage: 40.0+ bytes

Here we force the intcolumn to strand tell parse_datesto use the date_parser to parse the date column:

这里我们强制intstr并告诉parse_dates使用 date_parser 来解析日期列:

In [6]:
pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null object
float    1 non-null float64
date     1 non-null datetime64[ns]
str      1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 40.0+ bytes

Similarly we could've pass the to_datetimefunction to convert the dates:

同样,我们可以传递to_datetime函数来转换日期:

In [5]:
pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null int64
float    1 non-null float64
date     1 non-null datetime64[ns]
str      1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 40.0 bytes