如何使用Python保存“完整网页”而不仅仅是基本的html
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14516590/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to save "complete webpage" not just basic html using Python
提问by
I am using following code to save webpage using Python:
我正在使用以下代码使用 Python 保存网页:
import urllib
import sys
from bs4 import BeautifulSoup
url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')
Problem: This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)
问题:此代码将 html 保存为基本 html,没有 javascripts、图像等。我想将网页保存为完整的(就像我们在浏览器中有选项一样)
Update: I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:
更新:我现在使用以下代码来保存 webapge 的所有 js/images/css 文件,以便它可以保存为完整的网页,但我的输出 html 仍然像基本 html 一样保存:
import pycurl
import StringIO
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()
采纳答案by root
Try emulating your browser with selenium. This script will pop up the save asdialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).
尝试使用selenium模拟您的浏览器。此脚本将弹出save as网页对话框。您仍然需要弄清楚如何模拟按下 Enter 以开始下载,因为文件对话框超出了 selenium 的范围(您如何操作也取决于操作系统)。
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
br = webdriver.Firefox()
br.get('http://www.google.com/')
save_me = ActionChains(br).key_down(Keys.CONTROL)\
.key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()
Also I think following @Ambersuggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_sourcewill get you the entire dom along with the dynamic content generated by javascript.
此外,我认为遵循@Amber建议获取链接资源可能更简单,因此是更好的解决方案。尽管如此,我认为使用 selenium 是一个很好的起点,因为br.page_source它将为您提供整个 dom 以及由 javascript 生成的动态内容。
回答by rajatomar788
You can easily do that with simple python library pywebcopy.
您可以使用简单的 python 库 pywebcopy 轻松做到这一点。
For Current version: 5.0.1
对于当前版本:5.0.1
from pywebcopy import save_webpage
url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'
kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}
save_webpage(url, download_folder, **kwargs)
You will have html, css, js all at your download_folder. Completely working like original site.
您的 download_folder 中将包含 html、css、js。完全像原始网站一样工作。
回答by Rich Lysakowski PhD
To get the script above by @rajatomar788 to run, I had to do all of the following imports first:
为了让@rajatomar788 上面的脚本运行,我必须首先执行以下所有导入:
To run pywebcopy you will need to install the following packages:
要运行 pywebcopy,您需要安装以下软件包:
pip install pywebcopy
pip install pyquery
pip install w3lib
pip install parse
pip install lxml
After that it worked with a few errors, but I did get the folder filled with the files that make up the webpage.
之后它出现了一些错误,但我确实让文件夹充满了构成网页的文件。
webpage - INFO - Starting save_assets Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'
webpage - Level 100 - Queueing download of <89> asset files.
Exception in thread <Element(LinkTag, file:///++resource++images/favicon2.ico)>:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 312, in run
super(LinkTag, self).run()
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 58, in run
self.download_file()
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 107, in download_file
req = SESSION.get(url, stream=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\configs.py", line 244, in get
return super(AccessAwareSession, self).get(url, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///++resource++images/favicon2.ico'
webpage - INFO - Starting save_html Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'

