如何使用Python保存“完整网页”而不仅仅是基本的html

Question

提问by

I am using following code to save webpage using Python:

我正在使用以下代码使用 Python 保存网页：

import urllib
import sys
from bs4 import BeautifulSoup

url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')

Problem: This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)

问题：此代码将 html 保存为基本 html，没有 javascripts、图像等。我想将网页保存为完整的（就像我们在浏览器中有选项一样）

Update: I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:

更新：我现在使用以下代码来保存 webapge 的所有 js/images/css 文件，以便它可以保存为完整的网页，但我的输出 html 仍然像基本 html 一样保存：

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()

Answer 1

采纳答案by root

Try emulating your browser with selenium. This script will pop up the save asdialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).

尝试使用selenium模拟您的浏览器。此脚本将弹出save as网页对话框。您仍然需要弄清楚如何模拟按下 Enter 以开始下载，因为文件对话框超出了 selenium 的范围（您如何操作也取决于操作系统）。

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

br = webdriver.Firefox()
br.get('http://www.google.com/')

save_me = ActionChains(br).key_down(Keys.CONTROL)\
         .key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()

Also I think following @Ambersuggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_sourcewill get you the entire dom along with the dynamic content generated by javascript.

此外，我认为遵循@Amber建议获取链接资源可能更简单，因此是更好的解决方案。尽管如此，我认为使用 selenium 是一个很好的起点，因为br.page_source它将为您提供整个 dom 以及由 javascript 生成的动态内容。

Answer 2

回答by rajatomar788

You can easily do that with simple python library pywebcopy.

您可以使用简单的 python 库 pywebcopy 轻松做到这一点。

For Current version: 5.0.1

对于当前版本：5.0.1

from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

您的 download_folder 中将包含 html、css、js。完全像原始网站一样工作。

Answer 3

回答by Rich Lysakowski PhD

To get the script above by @rajatomar788 to run, I had to do all of the following imports first:

为了让@rajatomar788 上面的脚本运行，我必须首先执行以下所有导入：

To run pywebcopy you will need to install the following packages:

要运行 pywebcopy，您需要安装以下软件包：

pip install pywebcopy 
pip install pyquery
pip install w3lib
pip install parse 
pip install lxml

After that it worked with a few errors, but I did get the folder filled with the files that make up the webpage.

之后它出现了一些错误，但我确实让文件夹充满了构成网页的文件。

webpage    - INFO     - Starting save_assets Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'
webpage    - Level 100 - Queueing download of <89> asset files.
Exception in thread <Element(LinkTag, file:///++resource++images/favicon2.ico)>:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 312, in run
    super(LinkTag, self).run()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 58, in run
    self.download_file()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 107, in download_file
    req = SESSION.get(url, stream=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\configs.py", line 244, in get
    return super(AccessAwareSession, self).get(url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///++resource++images/favicon2.ico'

webpage    - INFO     - Starting save_html Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'

如何使用Python保存“完整网页”而不仅仅是基本的html

提问by

采纳答案by root

回答by rajatomar788

回答by Rich Lysakowski PhD

To run pywebcopy you will need to install the following packages:

要运行 pywebcopy，您需要安装以下软件包：

相关推荐

最近更新

标签

如何使用Python保存“完整网页”而不仅仅是基本的html

提问by

采纳答案by root

回答by rajatomar788

回答by Rich Lysakowski PhD

To run pywebcopy you will need to install the following packages:

要运行 pywebcopy，您需要安装以下软件包：

相关推荐

Python，相反的函数 urllib.urlencode

Python 从 pyodbc execute() 语句返回列名

Python Pandas - 如何在列中展平分层索引

如何清除 Python 脚本中间的所有变量？

相关推荐

最近更新

标签