pandas 熊猫 read_csv dtype 前导零

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16929056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:52:36  来源:igfitidea点击:

Pandas read_csv dtype leading zeros

pythonstringcsvpandas

提问by Radical Edward

So I'm reading in a station codes csv file from NOAA which looks like this:

所以我正在从 NOAA 读取一个站号 csv 文件,它看起来像这样:

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % isince they are always six digits, but you know... that's the lazy mans way.

前两列包含气象站的代码,有时它们有前导零。当Pandas在没有指定 dtype 的情况下导入它们时,它们会变成整数。这并不是什么大问题,因为我可以遍历数据帧索引并用类似的东西替换它们,"%06d" % i因为它们总是六位数,但你知道......这是懒惰的人的方式。

The csv is obtained using this code:

使用以下代码获取 csv:

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

which is all well and good but when I go and try and read it using this:

这一切都很好,但是当我去尝试使用它阅读它时:

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

or

或者

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

I get a nasty error message:

我收到一条令人讨厌的错误消息:

File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
    return parser.read()
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood

It's a pretty big csv (31k rows) so maybe that has something to do with it?

这是一个非常大的 csv(31k 行),所以也许这与它有关?

回答by firelynx

This is an issue of pandas dtype guessing.

这是pandas dtype guessing的问题。

Pandas sees numbers and guessesyou want it to be numbers.

Pandas 看到数字并猜测您希望它是数字。

To make pandas not doubt your intentions, you should set the dtype you want: object

为了让 Pandas 不会怀疑你的意图,你应该设置你想要的 dtype: object

pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})

Will do the trick

会做的伎俩

回答by Lev Landau

This problem caused me all sorts of headaches when parsing a file with serial numbers. For unknown reasons 00794 and 000794 are two distinct serial numbers. I eventually came up with

在解析带有序列号的文件时,这个问题让我很头疼。由于未知原因,00794 和 000794 是两个不同的序列号。我最终想出了

converters = {'serial_number': str}

回答by Andy Hayden

It looks like you have to specify the length of the string if you don't want it to be an object.
For example:

如果您不希望它成为对象,则看起来您必须指定字符串的长度。
例如:

dtype={'USAF': '|S6'}

I can't find the reference for this, but I seem to recall Wes discussing this very issue (perhaps in a talk). He suggested that numpy doesn't allow "proper" variable length strings (see this question/answer), and using the maximum length to populate the array will more often than not be incredibly space inefficient (even if a string is short it'll use as much space as the longest string).

我找不到这方面的参考资料,但我似乎记得 Wes 讨论过这个问题(也许是在一次谈话中)。他建议 numpy 不允许“适当的”可变长度字符串(请参阅此问题/答案),并且使用最大长度来填充数组通常会导致非常低的空间效率(即使字符串很短,它也会使用与最长字符串一样多的空间)。

As @Wes points out, this is also a case where:

正如@Wes 指出的那样,这也是一种情况:

dtype={'USAF': object}

works just as well.

效果也一样。

回答by Chris Conlan

You can pass a dictionary of functions to converterswhere the keys are numeric column indices. So, if you don't know what your column names will be, you can do this (provided you have less than 100 columns).

您可以将函数字典传递给converters键是数字列索引的地方。因此,如果您不知道您的列名是什么,您可以这样做(假设您的列数少于 100)。

pd.read_csv('some_file.csv', converters={i: str for i in range(100)})

pd.read_csv('some_file.csv', converters={i: str for i in range(100)})

回答by Acumenus

With Pandas 1, how about:

使用 Pandas 1,如何:

df.read_csv(..., dtype={"my_confusing_col": "string"})

Note that will use the column dtype stringwhich uses pd.NAfor any missing values. All leading zeros will of course be preserved.

请注意,将使用列D型string,它使用pd.NA的任何缺失值。当然,所有前导零都将被保留。