将 .csv 文件从 URL 读入 Python 3.x - _csv.Error:迭代器应该返回字符串,而不是字节(您是否以文本模式打开文件?)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18897029/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read .csv file from URL into Python 3.x - _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
提问by Chris
I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
我一直在为这个简单的问题苦苦挣扎太久,所以我想我会寻求帮助。我正在尝试将 National Library of Medicine ftp 站点中的期刊文章列表读入 Python 3.3.2(在 Windows 7 上)。期刊文章位于 .csv 文件中。
I have tried the following code:
我尝试了以下代码:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
它导致以下错误:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.
我认为我应该使用字符串而不是字节?对这个简单问题的任何帮助,以及对出了什么问题的解释,将不胜感激。
采纳答案by Diego Herranz
The problem relies on urllib
returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
问题依赖于urllib
返回字节。作为证明,您可以尝试使用浏览器下载 csv 文件并将其作为常规文件打开,问题就消失了。
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
可以通过适当的编码将字节解码为字符串来解决。例如:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile)
which can be easier to read.
最后一行也可以是:data = list(csvfile)
这可以更容易阅读。
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
顺便说一下,由于 csv 文件非常大,它可能会很慢并且会消耗内存。也许最好使用发电机。
EDIT:Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
编辑:使用 Steven Rumbalski 提出的编解码器,因此没有必要读取整个文件进行解码。内存消耗减少,速度提高。
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.
请注意,该列表也不是出于同样的原因创建的。
回答by HennyH
urlopen
will return a urllib.response.addinfourl
instance for an ftp request.
urlopen
将返回一个urllib.response.addinfourl
ftp 请求的实例。
For ftp, file, and data urls and requests explicity handled by legacy URLopener and FancyURLopener classes, this function returns a urllib.response.addinfourl object which can work as context manager...
对于由遗留 URLopener 和 FancyURLopener 类显式处理的 ftp、文件和数据 url 和请求,此函数返回一个 urllib.response.addinfourl 对象,该对象可以用作上下文管理器...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream
is a file likeobject, using .read()
would return the contents however csv.reader
requires an iterable in this case:
此时ftpstream
是一个类似对象的文件, using.read()
将返回内容,但csv.reader
在这种情况下需要一个可迭代的:
Defining a generator like so:
像这样定义一个生成器:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
我们可以像这样创建我们的 csv 阅读器:
reader = csv.reader(to_lines(ftps))
And with a url
并带有网址
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
编码:
for row in reader: print row
Prints
印刷
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']
回答by Irvin H.
Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests
package (which is sometimes seen as an alternative to urlib.request
).
尽管已经有一个公认的答案,但我想我会通过展示我如何使用该requests
包实现类似的东西来增加知识体系(有时被视为 的替代urlib.request
)。
The basis of using codecs.itercode()
to solve the original problem is still the same as in the accepted answer.
codecs.itercode()
用于解决原始问题的基础仍然与接受的答案相同。
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streamingprovided through the requests
package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
在这里,我们还看到了通过包提供的流的使用,requests
以避免首先通过网络将整个文件加载到内存中(如果文件很大,这可能需要很长时间)。
I thought it might be useful since it helped me, as I was using requests
rather than urllib.request
in Python 3.6.
我认为它可能很有用,因为它帮助了我,因为我使用的是 Python 3.6requests
而不是urllib.request
Python 3.6。
Some of the ideas (e.g using closing()
) are picked from this similar post