Python - 从网站登录并下载特定文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45107839/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python - Login and download specific file from website
提问by
My attempt to log into a website and download a specific file has hit a fall.
我登录网站并下载特定文件的尝试失败了。
Specifically, I am logging into this website http://www.gaez.iiasa.ac.at/w/ctrl?_flow=Vwr&_view=Welcome&fieldmain=main_lr_lco_cult&idPS=0&idAS=0&idFS=0
具体来说,我登录这个网站http://www.gaez.iiasa.ac.at/w/ctrl?_flow=Vwr&_view=Welcome&fieldmain=main_lr_lco_cult&idPS=0&idAS=0&idFS=0
in order so that I can select specific variables and parameters before I download the file and save as an excel or csv.
以便我可以在下载文件并保存为 excel 或 csv 之前选择特定的变量和参数。
In particular, I want to toggle the highlighted inputs , before selecting the type of crop, water supply, input level, time period, and geographic areas before downloading the file under 'Visualization and Download' button.
特别是,我想在选择作物类型、供水、输入水平、时间段和地理区域之前切换突出显示的输入,然后在“可视化和下载”按钮下下载文件。
For example, I would like to get the data for Wheat (Crop), rain-fed (Water Supply), High (Input Level), 1961-1990 (Time Period, Baseline), United States of America (Geographic Areas). Then I want to save it as an excel file.
例如,我想获取小麦(作物)、雨养(供水)、高(输入水平)、1961-1990(时间段、基线)、美利坚合众国(地理区域)的数据。然后我想将其另存为excel文件。
This is my code so far:
到目前为止,这是我的代码:
# Import library
import requests
# Define url, username, and password
url = 'http://www.gaez.iiasa.ac.at/w/ctrl?_flow=Vwr&_view=Welcome&fieldmain=main_lr_lco_cult&idPS=0&idAS=0&idFS=0'
user, password = 'Username', 'Password'
resp = requests.get(url, auth=(user, password))
Perhaps I'm ingrained in the trenches of the entire process to see an easy, viable solution, but any help is greatly appreciated.
也许我在整个过程的战壕中根深蒂固,希望看到一个简单可行的解决方案,但非常感谢任何帮助。
采纳答案by Rados?aw Za?uska
Website that you linked uses HTTP POST based login from. In your code you have:
您链接的网站使用基于 HTTP POST 的登录方式。在您的代码中,您有:
resp = requests.get(url, auth=(user, password))
which will use basic http authentication http://docs.python-requests.org/en/master/user/authentication/#basic-authentication
这将使用基本的 http 身份验证http://docs.python-requests.org/en/master/user/authentication/#basic-authentication
To login to this site you need two things:
要登录此站点,您需要做两件事:
- persistent session cookie
- HTTP POST request to login form URL
- 持久会话cookie
- 登录表单 URL 的 HTTP POST 请求
First of all let's create session object that will be holding cookies form server http://docs.python-requests.org/en/master/user/advanced/#session-objects
首先让我们创建会话对象,它将保存 cookie 表单服务器http://docs.python-requests.org/en/master/user/advanced/#session-objects
s = requests.Session()
Next you need to visit site using GET request. This will generate cookie for you (server will send cookie for your session).
接下来,您需要使用 GET 请求访问站点。这将为您生成 cookie(服务器将为您的会话发送 cookie)。
s.get(site_url)
Final step will be to login to site. You can use Firebug or Chrome Developer Console (depending of what browser you use) to examine what fields needs to be send (Go to Network tab).
最后一步是登录网站。您可以使用 Firebug 或 Chrome 开发者控制台(取决于您使用的浏览器)来检查需要发送的字段(转到网络选项卡)。
s.post(site_url, data={'_username': 'user', '_password': 'pass'})
This two fields (_username, _password) seems to be valid for your site, but as I examine data which was send during POST request, there were more fields. I don't know if they are necessary.
这两个字段(_username、_password)似乎对您的站点有效,但是当我检查在 POST 请求期间发送的数据时,还有更多字段。我不知道它们是否有必要。
After that you will be authenticated. Next thing will be to visit URL for file you would like to download.
之后,您将通过身份验证。接下来是访问您要下载的文件的 URL。
s.get(file_url)
The link you provided contains query string with various options that are related probably to options you want to be highlighted. You can use it to download file with desired options.
您提供的链接包含带有各种选项的查询字符串,这些选项可能与您要突出显示的选项有关。您可以使用它来下载具有所需选项的文件。
Warning Note
警告说明
Note that this site is not using HTTPS secure connection. Any credentials you will provide will go through the internet unencrypted and can be potentially see by someone who should not see them.
请注意,此站点未使用 HTTPS 安全连接。您将提供的任何凭据都将通过互联网未加密,并且可能会被不应该看到它们的人看到。