在 linux 服务器上保存完整网页的最佳方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4769433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 02:36:16  来源:igfitidea点击:

What's the best way to save a complete webpage on a linux server?

linuxcurlsavewebpagewget

提问by Tomas

I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all to work in the same directory?

我需要在我的 linux 服务器上存档完整的页面,包括任何链接的图像等。寻找最佳解决方案。有没有办法保存所有资产,然后重新链接它们以在同一目录中工作?

I've thought about using curl, but I'm unsure of how to do all of this. Also, will I maybe need PHP-DOM?

我曾考虑过使用 curl,但我不确定如何执行所有这些操作。另外,我可能需要 PHP-DOM 吗?

Is there a way to use firefox on the server and copy the temp files after the address has been loaded or similar?

有没有办法在服务器上使用firefox并在地址加载或类似后复制临时文件?

Any and all input welcome.

欢迎任何和所有输入。

Edit:

编辑:

It seems as though wget is 'not' going to work as the files need to be rendered. I have firefox installed on the server, is there a way to load the url in firefox and then grab the temp files and clear the temp files after?

似乎 wget '不会' 工作,因为需要呈现文件。我在服务器上安装了 Firefox,有没有办法在 Firefox 中加载 url,然后抓取临时文件并清除临时文件?

回答by Arnaud Le Blanc

wgetcan do that, for example:

wget可以这样做,例如:

wget -r http://example.com/

This will mirror the whole example.com site.

这将反映整个 example.com 站点。

Some interesting options are:

一些有趣的选项是:

-Dexample.com: do not follow links of other domains
--html-extension: renames pages with text/html content-type to .html

-Dexample.com:不要跟随其他域的链接
--html-extension:将带有 text/html 内容类型的页面重命名为 .html

Manual: http://www.gnu.org/software/wget/manual/

手册:http: //www.gnu.org/software/wget/manual/

回答by meder omuraliev

wget -r http://yoursite.com

Should be sufficient and grab images/media. There are plenty of options you can feed it.

应该足够了并抓取图像/媒体。有很多选择可以喂它。

Note: I believe wgetnor any other program supports downloading images specified through CSS - so you may need to do that yourself manually.

注意:我相信wget没有任何其他程序支持下载通过 CSS 指定的图像 - 因此您可能需要自己手动执行此操作。

Here may be some useful arguments: http://www.linuxjournal.com/content/downloading-entire-web-site-wget

这里可能有一些有用的参数:http: //www.linuxjournal.com/content/downloading-entire-web-site-wget

回答by thkala

If all the content in the web page was static, you could get around this issue with something like wget:

如果网页中的所有内容都是静态的,您可以使用以下内容解决此问题wget

$ wget -r -l 10 -p http://my.web.page.com/

or some variation thereof.

或其一些变体。

Since you also have dynamic pages, you cannot in general archive such a web page using wgetor any simple HTTP client. A proper archive needs to incorporate the contents of the backend database and any server-side scripts. That means that the only way to do this properly is to copy the backing server-side files. That includes at least the HTTP server document root and any database files.

由于您还有动态页面,因此您通常无法使用wget或任何简单的 HTTP 客户端来存档此类网页。适当的存档需要合并后端数据库的内容和任何服务器端脚本。这意味着正确执行此操作的唯一方法是复制支持的服务器端文件。这至少包括 HTTP 服务器文档根和任何数据库文件。

EDIT:

编辑:

As a work-around, you could modify your webpage so that a suitably priviledged user could download all the server-side files, as well as a text-mode dump of the backing database (e.g. an SQL dump). You should take extremecare to avoid opening any security holes through this archiving system.

作为一种变通方法,您可以修改您的网页,以便具有适当特权的用户可以下载所有服务器端文件,以及支持数据库的文本模式转储(例如 SQL 转储)。您应该格外小心,避免通过此归档系统打开任何安全漏洞。

If you are using a virtual hosting provider, most of them provide some kind of Web interface that allows backing-up the whole site. If you use an actual server, there is a large number of back-up solutions that you could install, including a few Web-based ones for hosted sites.

如果您使用的是虚拟主机提供商,他们中的大多数都提供某种 Web 界面,允许备份整个站点。如果您使用实际服务器,则可以安装大量备份解决方案,包括一些用于托管站点的基于 Web 的解决方案。

回答by SuB

Use following command:

使用以下命令:

wget -E  -k -p http://yoursite.com

Use -Eto adjust extensions. Use -kto convert links to load the page from your storage. Use -pto download all objects inside the page.

使用-E调整扩展。使用-k转换链接,从存储加载页面。使用-p下载页面中的所有对象。

Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.

请注意,此命令不会下载指定页面中超链接的其他页面。这意味着该命令只下载正确加载指定页面所需的对象。