使用python请求下载CSV
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35371043/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Use python requests to download CSV
提问by viviwill
Here is my code:
这是我的代码:
import csv
import requests
with requests.Session() as s:
s.post(url, data=payload)
download = s.get('url that directly download a csv report')
This gives me the access to the csv file. I tried different method to deal with the download:
这使我可以访问 csv 文件。我尝试了不同的方法来处理下载:
This will give the the csv file in one string:
这将在一个字符串中提供 csv 文件:
print download.content
This print the first row and return error: _csv.Error: new-line character seen in unquoted field
这将打印第一行并返回错误:_csv.Error: new-line character seen in unquoted field
cr = csv.reader(download, dialect=csv.excel_tab)
for row in cr:
print row
This will print a letter in each row and it won't print the whole thing:
这将在每一行打印一个字母,但不会打印整个内容:
cr = csv.reader(download.content, dialect=csv.excel_tab)
for row in cr:
print row
My question is: what's the most efficient way to read a csv file in this situation. And how to download it.
我的问题是:在这种情况下读取 csv 文件的最有效方法是什么。以及如何下载。
thanks
谢谢
采纳答案by HEADLESS_0NE
This should help:
这应该有帮助:
import csv
import requests
CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
with requests.Session() as s:
download = s.get(CSV_URL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
Ouput sample:
输出样本:
['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type', 'sale_date', 'price', 'latitude', 'longitude']
['3526 HIGH ST', 'SACRAMENTO', '95838', 'CA', '2', '1', '836', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '59222', '38.631913', '-121.434879']
['51 OMAHA CT', 'SACRAMENTO', '95823', 'CA', '3', '1', '1167', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68212', '38.478902', '-121.431028']
['2796 BRANCH ST', 'SACRAMENTO', '95815', 'CA', '2', '1', '796', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68880', '38.618305', '-121.443839']
['2805 JANETTE WAY', 'SACRAMENTO', '95815', 'CA', '2', '1', '852', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '69307', '38.616835', '-121.439146']
[...]
Related question with answer: https://stackoverflow.com/a/33079644/295246
相关问题与答案:https: //stackoverflow.com/a/33079644/295246
Edit: Other answers are useful if you need to download large files (i.e. stream=True
).
编辑:如果您需要下载大文件(即stream=True
),其他答案很有用。
回答by Ares Ou
From a little search, that I understand the file should be opened in universal newline mode, which you cannot directly do with a response content (I guess).
通过一点点搜索,我了解到该文件应该以通用换行符模式打开,您不能直接使用响应内容(我猜)。
To finish the task, you can either save the downloaded content to a temporary file, or process it in memory.
要完成任务,您可以将下载的内容保存到临时文件中,也可以在内存中进行处理。
Save as file:
另存为文件:
import requests
import csv
import os
temp_file_name = 'temp_csv.csv'
url = 'http://url.to/file.csv'
download = requests.get(url)
with open(temp_file_name, 'w') as temp_file:
temp_file.writelines(download.content)
with open(temp_file_name, 'rU') as temp_file:
csv_reader = csv.reader(temp_file, dialect=csv.excel_tab)
for line in csv_reader:
print line
# delete the temp file after process
os.remove(temp_file_name)
In memory:
在记忆中:
(To be updated)
(要被更新)
回答by aheld
You can update the accepted answer with the iter_lines method of requests if the file is very large
如果文件非常大,您可以使用请求的 iter_lines 方法更新接受的答案
import csv
import requests
CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
with requests.Session() as s:
download = s.get(CSV_URL)
line_iterator = (x.decode('utf-8') for x in download.iter_lines(decode_unicode=True))
cr = csv.reader(line_iterator, delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
回答by The Aelfinn
To simplify these answers, and increase performance when downloading a large file, the below may work a bit more efficiently.
为了简化这些答案,并在下载大文件时提高性能,以下可能会更有效地工作。
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
print row
By setting stream=True
in the GET request, when we pass r.iter_lines()
to csv.reader(), we are passing a generatorto csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader
.
通过stream=True
在 GET 请求中设置,当我们传递r.iter_lines()
给 csv.reader() 时,我们将一个生成器传递给 csv.reader()。通过这样做,我们使 csv.reader() 能够懒惰地迭代响应中的每一行for row in reader
。
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.
这避免了在我们开始处理之前将整个文件加载到内存中,从而大大减少了大文件的内存开销。
回答by Antti Haapala
You can also use the DictReader
to iterate dictionaries of {'columnname': 'value', ...}
您还可以使用DictReader
来迭代字典{'columnname': 'value', ...}
import csv
import requests
response = requests.get('http://example.test/foo.csv')
reader = csv.DictReader(response.iter_lines())
for record in reader:
print(record)
回答by wescpy
I like the answers from The Aelfinnand aheld. I can improve them only by shortening a bit more, removing superfluous pieces, using a real data source, making it 2.x & 3.x-compatible, and maintaining the high-level of memory-efficiency seen elsewhere:
我喜欢The Aelfinn和aheld的答案。我只能通过缩短一点、删除多余的部分、使用真实的数据源、使其与 2.x 和 3.x 兼容并保持其他地方看到的高水平内存效率来改进它们:
import csv
import requests
CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
with requests.get(CSV_URL, stream=True) as r:
lines = (line.decode('utf-8') for line in r.iter_lines())
for row in csv.reader(lines):
print(row)
Too bad 3.x is less flexible CSV-wise because the iterator must emit Unicode strings (while requests
does bytes
) because the 2.x-only version—for row in csv.reader(r.iter_lines()):
—is more Pythonic (shorter and easier-to-read). Anyhow, note the 2.x/3.x solution above won't handle the situation described by the OP where a NEWLINE is found unquoted in the data read.
太糟糕了 3.x 在 CSV 方面不太灵活,因为迭代器必须发出 Unicode 字符串(而requests
确实bytes
),因为 2.x-only 版本 - for row in csv.reader(r.iter_lines()):
- 更 Pythonic(更短且更易于阅读)。无论如何,请注意上面的 2.x/3.x 解决方案不会处理 OP 描述的情况,即在读取的数据中发现 NEWLINE 未加引号。
For the part of the OP's question regarding downloading(vs. processing) the actual CSV file, here's another script that does that, 2.x & 3.x-compatible, minimal, readable, and memory-efficient:
对于 OP 关于下载(与处理)实际 CSV 文件的部分问题,这里是另一个脚本,它与2.x 和 3.x 兼容、最小、可读且内存高效:
import os
import requests
CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
with open(os.path.split(CSV_URL)[1], 'wb') as f, \
requests.get(CSV_URL, stream=True) as r:
for line in r.iter_lines():
f.write(line)
回答by aamir23
The following approach worked well for me. I also did not need to use csv.reader()
or csv.writer()
functions, which I feel makes the code cleaner. The code is compatible with Python2 and Python 3.
以下方法对我来说效果很好。我也不需要使用csv.reader()
或csv.writer()
函数,我觉得这使代码更清晰。该代码与 Python2 和 Python 3 兼容。
from six.moves import urllib
DOWNLOAD_URL = "https://raw.githubusercontent.com/gjreda/gregreda.com/master/content/notebooks/data/city-of-chicago-salaries.csv"
DOWNLOAD_PATH ="datasets\city-of-chicago-salaries.csv"
urllib.request.urlretrieve(URL,DOWNLOAD_PATH)
Note - six is a package that helps in writing code that is compatible with both Python 2 and Python 3. For additional details regarding six see - What does from six.moves import urllib
do in Python?
注-六是包,帮助编写代码,并与双方的Python 2和Python 3兼容有关6只看到更多的细节-什么是from six.moves import urllib
在Python呢?
回答by Michal Skop
I use this code (I use Python 3):
我使用此代码(我使用 Python 3):
import csv
import io
import requests
url = "http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv"
r = requests.get(url)
r.encoding = 'utf-8' # useful if encoding is not sent (or not sent properly) by the server
csvio = io.StringIO(r.text, newline="")
data = []
for row in csv.DictReader(csvio):
data.append(row)