使用 Python 2.7 读取和写入 CSV 文件,包括 unicode

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17245415/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:51:27  来源:igfitidea点击:

Read and Write CSV files including unicode with Python 2.7

pythoncsvpython-2.7unicodeexport

提问by Ruxuan Ouyang

I am new to Python, and I have a question about how to use Python to read and write CSV files. My file contains like Germany, French, etc. According to my code, the files can be read correctly in Python, but when I write it into a new CSV file, the unicode becomes some strange characters.

我是 Python 新手,有一个关于如何使用 Python 读写 CSV 文件的问题。我的文件包含像德国、法国等。根据我的代码,这些文件可以在 Python 中正确读取,但是当我将其写入新的 CSV 文件时,unicode 变成了一些奇怪的字符。

The data is like:
enter image description here

数据是这样的:
在此处输入图片说明

And my code is:

我的代码是:

import csv

f=open('xxx.csv','rb')
reader=csv.reader(f)

wt=open('lll.csv','wb')
writer=csv.writer(wt,quoting=csv.QUOTE_ALL)

wt.close()
f.close()

And the result is like:
enter image description here

结果是这样的:
在此处输入图片说明

What should I do to solve the problem?

我应该怎么做才能解决问题?

回答by Mark Tolonen

There is an example at the end of the csv module documentationthat demonstrates how to deal with Unicode. Below is copied directly from that example. Note that the strings read or written will be Unicode strings. Don't pass a byte string to UnicodeWriter.writerows, for example.

csv 模块文档的末尾有一个示例,演示了如何处理 Unicode。下面是直接从那个例子中复制的。请注意,读取或写入的字符串将是 Unicode 字符串。例如,不要将字节字符串传递给UnicodeWriter.writerows

import csv,codecs,cStringIO

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()
    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

with open('xxx.csv','rb') as fin, open('lll.csv','wb') as fout:
    reader = UnicodeReader(fin)
    writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
    for line in reader:
        writer.writerow(line)

Input (UTF-8 encoded):

输入(UTF-8 编码):

American,美国人
French,法国人
German,德国人

Output:

输出:

"American","美国人"
"French","法国人"
"German","德国人"

回答by dawg

Make sure you encode and decode as appropriate.

确保您根据需要进行编码和解码。

This example will roundtrip some example text in utf-8 to a csv file and back out to demonstrate:

此示例将 utf-8 中的一些示例文本往返到 csv 文件并返回以进行演示:

# -*- coding: utf-8 -*-
import csv

tests={'German': [u'Stra?e',u'ausl?sen',u'zerst?ren'], 
       'French': [u'fran?ais',u'américaine',u'épais'], 
       'Chinese': [u'中國的',u'英語',u'美國人']}

with open('/tmp/utf.csv','w') as fout:
    writer=csv.writer(fout)    
    writer.writerows([tests.keys()])
    for row in zip(*tests.values()):
        row=[s.encode('utf-8') for s in row]
        writer.writerows([row])

with open('/tmp/utf.csv','r') as fin:
    reader=csv.reader(fin)
    for row in reader:
        temp=list(row)
        fmt=u'{:<15}'*len(temp)
        print fmt.format(*[s.decode('utf-8') for s in temp])

Prints:

印刷:

German         Chinese        French         
Stra?e         中國的            fran?ais       
ausl?sen       英語             américaine     
zerst?ren      美國人            épais  

回答by Oz123

Another alternative:

另一种选择:

Use the code from the unicodecsv package ...

使用 unicodecsv 包中的代码...

https://pypi.python.org/pypi/unicodecsv/

https://pypi.python.org/pypi/unicodecsv/

>>> import unicodecsv as csv
>>> from io import BytesIO
>>> f = BytesIO()
>>> w = csv.writer(f, encoding='utf-8')
>>> _ = w.writerow((u'é', u'?'))
>>> _ = f.seek(0)
>>> r = csv.reader(f, encoding='utf-8')
>>> next(r) == [u'é', u'?']
True

This module is API compatible with the STDLIB csv module.

该模块与 STDLIB csv 模块的 API 兼容。

回答by tozCSS

I had the very same issue. The answer is that you are doing it right already. It is the problem of MS Excel. Try opening the file with another editor and you will notice that your encoding was successful already. To make MS Excel happy, move from UTF-8 to UTF-16. This should work:

我有同样的问题。答案是你已经做得对了。这是MS Excel的问题。尝试使用另一个编辑器打开文件,您会注意到您的编码已经成功。为了让 MS Excel 满意,请从 UTF-8 移动到 UTF-16。这应该有效:

class UnicodeWriter:
def __init__(self, f, dialect=csv.excel_tab, encoding="utf-16", **kwds):
    # Redirect output to a queue
    self.queue = StringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f

    # Force BOM
    if encoding=="utf-16":
        import codecs
        f.write(codecs.BOM_UTF16)

    self.encoding = encoding

def writerow(self, row):
    # Modified from original: now using unicode(s) to deal with e.g. ints
    self.writer.writerow([unicode(s).encode("utf-8") for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = data.encode(self.encoding)

    # strip BOM
    if self.encoding == "utf-16":
        data = data[2:]

    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

回答by Joe S

I couldn't respond to Mark above, but I just made one modification which fixed the error which was caused if data in the cells was not unicode, i.e. float or int data. I replaced this line into the UnicodeWriter function: "self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])" so that it became:

我无法回应上面的 Mark,但我只是进行了一项修改,修复了如果单元格中的数据不是 unicode(即 float 或 int 数据)而导致的错误。我将此行替换为 UnicodeWriter 函数:“self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])”使其成为:

class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
       self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()
    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

You will also need to "import types".

您还需要“导入类型”。

回答by weaming

Because strin python2 is bytesactually. So if want to write unicodeto csv, you must encode unicodeto strusing utf-8encoding.

因为str在python2中bytes实际上是。所以,如果想写unicode到CSV,必须编码unicodestr使用utf-8的编码。

def py2_unicode_to_str(u):
    # unicode is only exist in python2
    assert isinstance(u, unicode)
    return u.encode('utf-8')

Use class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds):

使用class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds)

  • py2
    • The csvfile: open(fp, 'w')
    • pass key and value in byteswhich are encoded with utf-8
      • writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
  • py3
    • The csvfile: open(fp, 'w')
    • pass normal dict contains stras rowto writer.writerow(row)
  • py2
    • csvfileopen(fp, 'w')
    • 传递键和值,bytes其中编码为utf-8
      • writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
  • py3
    • csvfileopen(fp, 'w')
    • 通过正常的dict包含str作为rowwriter.writerow(row)

Finally code

最后代码

import sys

is_py2 = sys.version_info[0] == 2

def py2_unicode_to_str(u):
    # unicode is only exist in python2
    assert isinstance(u, unicode)
    return u.encode('utf-8')

with open('file.csv', 'w') as f:
    if is_py2:
        data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'}

        # just one more line to handle this
        data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()}

        fields = list(data[0])
        writer = csv.DictWriter(f, fieldnames=fields)

        for row in data:
            writer.writerow(row)
    else:
        data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'}

        fields = list(data[0])
        writer = csv.DictWriter(f, fieldnames=fields)

        for row in data:
            writer.writerow(row)

Conclusion

结论

In python3, just use the unicode str.

在 python3 中,只需使用 unicode str

In python2, use unicodehandle text, use strwhen I/O occurs.

在python2中,使用unicode句柄文本,str发生I/O时使用。