pandas 熊猫 read_csv 和 UTF-16
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13690122/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas read_csv and UTF-16
提问by Brian Keegan
I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error:
我有一个以 UTF-16 编码的 CSV 文本文件(以便在其他人使用 Excel 时保留 Unicode 字符)但是在使用 Pandas 0.9.0 执行 read_csv 时,我收到了这个神秘错误:
df = pd.read_csv('data.txt',encoding='utf-16',sep='\t',header=0)
df.head()
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-18-85da1383cd9e> in <module>()
----> 1 df = pd.read_csv('candidates-spanish.txt',encoding='utf-16',sep='\t',header=0)
2 df.head()
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze, **kwds)
248 kdict['delimiter'] = sep
249
--> 250 return _read(TextParser, filepath_or_buffer, kdict)
251
252 @Appender(_read_table_doc)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
198 return parser
199
--> 200 return parser.get_chunk()
201
202 @Appender(_read_csv_doc)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
853 elif not self._has_complex_date_col:
854 index = self._get_simple_index(alldata, columns)
--> 855 index = self._agg_index(index)
856
857 elif self._has_complex_date_col:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _agg_index(self, index, try_parse_dates)
980 arr, _ = _convert_types(arr, col_na_values)
981 arrays.append(arr)
--> 982 index = MultiIndex.from_arrays(arrays, names=self.index_name)
983 return index
984
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1570
1571 return MultiIndex(levels=levels, labels=labels,
-> 1572 sortorder=sortorder, names=names)
1573
1574 @classmethod
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names)
1254 assert(len(levels) == len(labels))
1255 if len(levels) == 0:
-> 1256 raise Exception('Must pass non-zero number of levels/labels')
1257
1258 if len(levels) == 1:
Exception: Must pass non-zero number of levels/labels
Reading the data in line-by-line with csv.reader based on this exampleimplies that my data is not incorrectly formatted:
基于此示例使用 csv.reader 逐行读取数据意味着我的数据格式不正确:
from io import BytesIO
import csv
with open('data.txt','rb') as f:
r = f.read().decode('utf-16').encode('utf-8')
for l in csv.reader(BytesIO(r),delimiter='\t'):
print l
['Country', 'State/City', 'Title', 'Date', 'Catalogue', 'Wikipedia Election Page', 'Wikipedia Individual Page', 'Electoral Institution in Country', 'Twitter', 'CANDIDATE NAME 1', 'CANDIDATE NAME 2']
['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']
['Venezuela', 'N/A', 'President', '10/7/12', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles R.', 'Henrique Capriles', '']
Is there some pre-processing, an addition option in read_csv, or something else that needs to be done before pandas.read_csv can read a utf-16 file? Thanks!
是否有一些预处理、read_csv 中的附加选项,或者在 pandas.read_csv 可以读取 utf-16 文件之前需要完成的其他操作?谢谢!
采纳答案by Chang She
This is a bug, I think because csv reader was passing back an extra empty line in the beginning. It worked for me on Python 2.7.3 and pandas 0.9.1 if I do:
这是一个错误,我认为是因为 csv 阅读器在开始时传回了一个额外的空行。如果我这样做,它在 Python 2.7.3 和 pandas 0.9.1 上对我有用:
In [36]: pd.read_csv(BytesIO(fh.read().decode('UTF-16').encode('UTF-8')), sep='\t', header=0)
Out[36]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns:
Country 43 non-null values
State/City 43 non-null values
Title 43 non-null values
Date 43 non-null values
Catalogue 43 non-null values
Wikipedia Election Page 43 non-null values
Wikipedia Individual Page 43 non-null values
Electoral Institution in Country 43 non-null values
Twitter 43 non-null values
CANDIDATE NAME 1 43 non-null values
CANDIDATE NAME 2 16 non-null values
dtypes: object(11)
I reported the bug here: https://github.com/pydata/pandas/issues/2418On github master it unfortunately causes a segfault in the c-parser. We'll fix it.
我在这里报告了这个错误:https: //github.com/pydata/pandas/issues/2418在 github master 上,它不幸地导致了 c-parser 中的段错误。我们会修好它。
Now, interestingly: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful;)
现在,有趣的是:https: //softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-thinked-harmful;)
回答by avances123
Python3:
蟒蛇3:
with open('data.txt',encoding='UTF-16') as f:
df = pd.read_csv(f)
回答by locojay
from StringIO import StringIO
import pandas as pd
a = ['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']
pd.read_csv(StringIO('\t'.join(a)), delimiter='\t')
works here can upload the head of your data so I can test
在这里工作可以上传你的数据的头部,所以我可以测试

