Pandas read_csv 混合类型列作为字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28682562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas read_csv mixed types columns as string
提问by uday
Is there any option in pandas' read_csvfunction that can automatically convert every item of an objectdtype as str.
pandas 的read_csv函数中是否有任何选项可以自动将objectdtype 的每个项目转换为str.
For example, I get the following when trying to read a CSV file:
例如,我在尝试读取 CSV 文件时得到以下信息:
mydata = pandas.read_csv(myfile, sep="|", header=None)
mydata = pandas.read_csv(myfile, sep="|", header=None)
C:\...\pandas\io\parsers.py:1159: DtypeWarning: Columns (6,635) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
C:\...\pandas\io\parsers.py:1159: DtypeWarning: Columns (6,635) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
Is there a way such that (i) the warning is suppressed from printing, but (ii) I can capture the warning message in a string from where I can extract the specific columns, e.g. 6 and 635 in this case (so that I can fix the dtypesubsequently)? Or, alternatively, if I can specify whenever there are mixed types, the read_csvfunction should convert the values in that column to str?
有没有办法使 (i) 禁止打印警告,但是 (ii) 我可以在字符串中捕获警告消息,我可以从中提取特定列,例如在这种情况下为 6 和 635(以便我可以修复dtype随后)?或者,如果我可以指定任何时候有mixed types,该read_csv函数应该将该列中的值转换为str?
I'm using Python 3.4.2 and Pandas 0.15.2
我正在使用 Python 3.4.2 和 Pandas 0.15.2
回答by mfitzp
The Dtypewarningis a Warningwhich can be caught and acted on. See herefor more information. To catch the warning we need to wrap the execution in a warnings.catch_warningsblock. The warning message and columns affected can be extracted using regex, then used to set the correct column type using .astype(target_type)
该Dtypewarning是Warning它可以捕获并采取行动。请参阅此处了解更多信息。为了捕捉警告,我们需要将执行包装在一个warnings.catch_warnings块中。可以使用 提取警告消息和受影响的列regex,然后使用 设置正确的列类型.astype(target_type)
import re
import pandas
import warnings
myfile = 'your_input_file_here.txt'
target_type = str # The desired output type
with warnings.catch_warnings(record=True) as ws:
warnings.simplefilter("always")
mydata = pandas.read_csv(myfile, sep="|", header=None)
print("Warnings raised:", ws)
# We have an error on specific columns, try and load them as string
for w in ws:
s = str(w.message)
print("Warning message:", s)
match = re.search(r"Columns \(([0-9,]+)\) have mixed types\.", s)
if match:
columns = match.group(1).split(',') # Get columns as a list
columns = [int(c) for c in columns]
print("Applying %s dtype to columns:" % target_type, columns)
mydata.iloc[:,columns] = mydata.iloc[:,columns].astype(target_type)
The result should be the same DataFramewith the problematic columns set to a strtype. It is worth noting that string columns in a Pandas DataFrameare reported as object.
结果应该DataFrame与设置为str类型的有问题的列相同。值得注意的是,PandasDataFrame中的字符串列报告为object.
回答by Acumenus
As noted in the error message itself, the simplest way to avoid pd.read_csvfrom returning mixed dtypes is to set low_memory=False:
正如错误消息本身所指出的,避免pd.read_csv返回混合 dtypes的最简单方法是设置low_memory=False:
df = pd.read_csv(..., low_memory=False)
This luxury is however not available when concatenating multiple dataframes using pd.concat.
然而,当使用pd.concat.

