Python pandas.read_csv 中的 dtype 和转换器有什么区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34139102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the difference between dtype and converters in pandas.read_csv?
提问by Bryan
pandas function read_csv() reads a .csv file. Its documentation is here
pandas 函数 read_csv() 读取 .csv 文件。它的文档在这里
According to documentation, we know:
根据文档,我们知道:
dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a': np.float64, ‘b': np.int32} (Unsupported with engine='python')
dtype :类型名称或列的字典 -> 类型,默认无数据或列的数据类型。例如 {'a': np.float64, 'b': np.int32} (不支持 engine='python')
and
和
converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels
转换器:字典,默认无用于转换某些列中值的函数字典。键可以是整数或列标签
When using this function, I can call either
pandas.read_csv('file',dtype=object)
or pandas.read_csv('file',converters=object)
. Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?
使用此函数时,我可以调用
pandas.read_csv('file',dtype=object)
或pandas.read_csv('file',converters=object)
。很明显,转换器,它的名字可以说数据类型将被转换,但我想知道dtype的情况?
采纳答案by EdChum
The semantic difference is that dtype
allows you to specify how to treat the values, for example, either as numeric or string type.
语义差异在于dtype
允许您指定如何处理值,例如,作为数字类型或字符串类型。
Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.
转换器允许您解析输入数据以使用转换函数将其转换为所需的数据类型,例如,将字符串值解析为日期时间或其他所需的数据类型。
Here we see that pandas tries to sniff the types:
在这里我们看到熊猫试图嗅探类型:
In [2]:
df = pd.read_csv(io.StringIO(t))
t="""int,float,date,str
001,3.31,2015/01/01,005"""
df = pd.read_csv(io.StringIO(t))
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int 1 non-null int64
float 1 non-null float64
date 1 non-null object
str 1 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 40.0+ bytes
You can see from the above that 001
and 005
are treated as int64
but the date string stays as str
.
您可以从上面看到001
和005
被视为int64
但日期字符串保持为str
.
If we say everything is object
then essentially everything is str
:
如果我们说一切都是object
那么基本上一切都是str
:
In [3]:
df = pd.read_csv(io.StringIO(t), dtype=object).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int 1 non-null object
float 1 non-null object
date 1 non-null object
str 1 non-null object
dtypes: object(4)
memory usage: 40.0+ bytes
Here we force the int
column to str
and tell parse_dates
to use the date_parser to parse the date column:
这里我们强制int
列str
并告诉parse_dates
使用 date_parser 来解析日期列:
In [6]:
pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int 1 non-null object
float 1 non-null float64
date 1 non-null datetime64[ns]
str 1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 40.0+ bytes
Similarly we could've pass the to_datetime
function to convert the dates:
同样,我们可以传递to_datetime
函数来转换日期:
In [5]:
pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int 1 non-null int64
float 1 non-null float64
date 1 non-null datetime64[ns]
str 1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 40.0 bytes