将 .csv 文件从 URL 读入 Python 3.x - _csv.Error:迭代器应该返回字符串,而不是字节(您是否以文本模式打开文件?)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18897029/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:12:40  来源:igfitidea点击:

Read .csv file from URL into Python 3.x - _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

pythonurlcsvpython-3.x

提问by Chris

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.

我一直在为这个简单的问题苦苦挣扎太久,所以我想我会寻求帮助。我正在尝试将 National Library of Medicine ftp 站点中的期刊文章列表读入 Python 3.3.2(在 Windows 7 上)。期刊文章位于 .csv 文件中。

I have tried the following code:

我尝试了以下代码:

import csv
import urllib.request

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]

It results in the following error:

它导致以下错误:

Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.

我认为我应该使用字符串而不是字节?对这个简单问题的任何帮助,以及对出了什么问题的解释,将不胜感激。

采纳答案by Diego Herranz

The problem relies on urllibreturning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.

问题依赖于urllib返回字节。作为证明,您可以尝试使用浏览器下载 csv 文件并将其作为常规文件打开,问题就消失了。

A similar problem was addressed here.

这里解决一个类似的问题。

It can be solved decoding bytes to strings with the appropriate encoding. For example:

可以通过适当的编码将字节解码为字符串来解决。例如:

import csv
import urllib.request

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8'))  # with the appropriate encoding 
data = [row for row in csvfile]

The last line could also be: data = list(csvfile)which can be easier to read.

最后一行也可以是:data = list(csvfile)这可以更容易阅读。

By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.

顺便说一下,由于 csv 文件非常大,它可能会很慢并且会消耗内存。也许最好使用发电机。

EDIT:Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.

编辑:使用 Steven Rumbalski 提出的编解码器,因此没有必要读取整个文件进行解码。内存消耗减少,速度提高。

import csv
import urllib.request
import codecs

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
    print(line)  # do something with line

Note that the list is not created either for the same reason.

请注意,该列表也不是出于同样的原因创建的。

回答by HennyH

urlopenwill return a urllib.response.addinfourlinstance for an ftp request.

urlopen将返回一个urllib.response.addinfourlftp 请求的实例。

For ftp, file, and data urls and requests explicity handled by legacy URLopener and FancyURLopener classes, this function returns a urllib.response.addinfourl object which can work as context manager...

对于由遗留 URLopener 和 FancyURLopener 类显式处理的 ftp、文件和数据 url 和请求,此函数返回一个 urllib.response.addinfourl 对象,该对象可以用作上下文管理器...

>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>

At this point ftpstreamis a file likeobject, using .read()would return the contents however csv.readerrequires an iterable in this case:

此时ftpstream是一个类似对象的文件, using.read()将返回内容,但csv.reader在这种情况下需要一个可迭代的:

Defining a generator like so:

像这样定义一个生成器:

def to_lines(f):
    line = f.readline()
    while line:
        yield line
        line = f.readline()

We can create our csv reader like so:

我们可以像这样创建我们的 csv 阅读器:

reader = csv.reader(to_lines(ftps))

And with a url

并带有网址

url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"

The code:

编码:

for row in reader: print row

Prints

印刷

>>> 
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']

回答by Irvin H.

Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requestspackage (which is sometimes seen as an alternative to urlib.request).

尽管已经有一个公认的答案,但我想我会通过展示我如何使用该requests包实现类似的东西来增加知识体系(有时被视为 的替代urlib.request)。

The basis of using codecs.itercode()to solve the original problem is still the same as in the accepted answer.

codecs.itercode()用于解决原始问题的基础仍然与接受的答案相同。

import codecs
from contextlib import closing
import csv
import requests

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"

with closing(requests.get(url, stream=True)) as r:
    reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
    for row in reader:
        print row   

Here we also see the use of streamingprovided through the requestspackage in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).

在这里,我们还看到了通过包提供的的使用,requests以避免首先通过网络将整个文件加载到内存中(如果文件很大,这可能需要很长时间)。

I thought it might be useful since it helped me, as I was using requestsrather than urllib.requestin Python 3.6.

我认为它可能很有用,因为它帮助了我,因为我使用的是 Python 3.6requests而不是urllib.requestPython 3.6。

Some of the ideas (e.g using closing()) are picked from this similar post

一些想法(例如使用closing())是从这个类似的帖子中挑选出来