Html 抓取整个网站
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9265172/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrape An Entire Website
提问by Dale Fraser
I'm looking for recommendations for a program to scrape and download an entire corporate website.
我正在寻找有关抓取和下载整个公司网站的程序的建议。
The site is powered by a CMS that has stopped working and getting it fixed is expensive and we are able to redevelop the website.
该网站由已停止工作的 CMS 提供支持,修复它的成本很高,我们能够重新开发该网站。
So I would like to just get the entire website as plain html / css / image content and do minor updates to it as needed until the new site comes along.
因此,我只想将整个网站作为纯 html/css/图像内容,并根据需要对其进行小幅更新,直到新网站出现。
Any recomendations?
有什么建议吗?
采纳答案by p.campbell
Consider HTTrack. It's a free and easy-to-use offline browser utility.
考虑HTTrack。这是一个免费且易于使用的离线浏览器实用程序。
It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.
它允许您将万维网站点从 Internet 下载到本地目录,递归构建所有目录,从服务器获取 HTML、图像和其他文件到您的计算机。
回答by Abhijeet Rastogi
回答by Tyler McGinnis
None of the above got exactly what I needed (the whole site and all assets). This worked though.
以上都没有得到我所需要的(整个网站和所有资产)。这虽然奏效。
First, follow thistutorial to get wget on OSX.
首先,按照本教程在 OSX 上获取 wget。
Then run this
然后运行这个
wget --recursive --html-extension --page-requisites --convert-links http://website.com
回答by T0ny lombardi
I know this is super old and I just wanted to put my 2 cents in.
我知道这太旧了,我只想把我的 2 美分放进去。
wget -m -k -K -E -l 7 -t 6 -w 5 http://www.website.com
wget -m -k -K -E -l 7 -t 6 -w 5 http://www.website.com
A little clarification regarding each of the switches:
关于每个开关的一些说明:
-m
Essentially, this means “mirror the site”, and it recursively grabs pages & images as it spiders through the site. It checks the timestamp, so if you run wget a 2nd time with this switch, it will only update files/pages that are newer than the previous time.
-m
从本质上讲,这意味着“镜像站点”,它在爬行整个站点时递归地抓取页面和图像。它会检查时间戳,因此如果您使用此开关第二次运行 wget,它只会更新比上一次更新的文件/页面。
-k
This will modify links in the html to point to local files. If instead of using things like page2.html
as links throughout your site you were actually using a full http://www.website.com/page2.html
you'll probably need/want this. I turn it on just to be on the safe side – chances are at least 1 link will cause a problem otherwise.
-k
这将修改 html 中的链接以指向本地文件。如果不是page2.html
在整个站点中使用诸如链接之类的东西,您实际上使用的是完整的,http://www.website.com/page2.html
那么您可能需要/想要这个。我打开它只是为了安全起见——否则至少有 1 个链接会导致问题。
-K
The option above (lowercase k) edits the html. If you want the “untouched” version as well, use this switch and it will save both the changed version and the original. It's just good practise in case something is awry and you want to compare both versions. You can always delete the one you didn't want later.
-K
上面的选项(小写 k)编辑 html。如果您还想要“未修改”的版本,请使用此开关,它将同时保存更改后的版本和原始版本。如果出现问题并且您想比较两个版本,这只是一种很好的做法。您可以随时删除您以后不想要的那个。
-E
This saves HTML & CSS with “proper extensions”. Careful with this one – if your site didn't have .html extensions on every page, this will add it. However, if your site already has every file named with something like “.htm” you'll now end up with “.htm.html”.
-E
这可以通过“适当的扩展”保存 HTML 和 CSS。小心这个——如果你的网站没有在每个页面上都有 .html 扩展名,这将添加它。但是,如果您的站点已经有每个文件以“.htm”之类的名称命名,那么您现在将以“.htm.html”结尾。
-l 7
By default, the -m we used above will recurse/spider through the entire site. Usually that's ok. But sometimes your site will have an infinite loop in which case wget will download forever. Think of the typical website.com/products/jellybeans/sort-by-/name/price/name/price/name/price
example. It's somewhat rare nowadays – most sites behave well and won't do this, but to be on the safe side, figure out the most clicks it should possibly take to get anywhere from the main page to reach any real page on the website, pad it a little (it would suck if you used a value of 7 and found out an hour later that your site was 8 levels deep!) and use that #. Of course, if you know your site has a structure that will behave, there's nothing wrong with omitting this and having the comfort of knowing that the 1 hidden page on your site that was 50 levels deep was actually found.
-l 7
默认情况下,我们上面使用的 -m 将在整个站点中递归/蜘蛛。通常那没问题。但有时您的站点会无限循环,在这种情况下 wget 将永远下载。想想典型的website.com/products/jellybeans/sort-by-/name/price/name/price/name/price
例子。现在这有点罕见 - 大多数网站表现良好并且不会这样做,但为了安全起见,找出从主页到网站上的任何真实页面可能需要的最多点击次数,pad它有点(如果您使用 7 的值并在一个小时后发现您的网站深度为 8 级,那会很糟糕!)并使用该#。当然,如果您知道您的网站有一个可以正常运行的结构,那么省略这一点也没有错,并且可以放心地知道您网站上 50 层深的 1 个隐藏页面实际上已被找到。
-t 6
If trying to access/download a certain page or file fails, this sets the number of retries before it gives up on that file and moves on. You usually do want it to eventuallygive up (set it to 0 if you want it to try forever), but you also don't want it to give up if the site was just being wonky for a second or two. I find 6 to be reasonable.
-t 6
如果尝试访问/下载某个页面或文件失败,这将设置在放弃该文件并继续之前的重试次数。您通常确实希望它最终放弃(如果您希望它永远尝试,请将其设置为 0),但如果站点只是一两秒钟不稳定,您也不希望它放弃。我觉得6是合理的。
-w 5
This tells wget to wait a few seconds (5 seconds in this case) before grabbing the next file. It's often critical to use something here (at least 1 second). Let me explain. By default, wget will grab pages as fast as it possibly can. This can easily be multiple requests per second which has the potential to put huge load on the server (particularly if the site is written in PHP, makes MySQL accesses on each request, and doesn't utilize a cache). If the website is on shared hosting, that load can get someone kicked off their host. Even on a VPS it can bring some sites to their knees. And even if the site itself survives, being bombarded with an insane number of requests within a few seconds can look like a DOS attack which could very well get your IP auto-blocked. If you don't know for certain that the site can handle a massive influx of traffic, use the -w # switch.5 is usually quite safe. Even 1 is probably ok most of the time. But use something.
-w 5
这告诉 wget 在抓取下一个文件之前等待几秒钟(在这种情况下为 5 秒钟)。在这里使用一些东西通常很重要(至少 1 秒)。让我解释。默认情况下,wget 会尽可能快地抓取页面。这很容易每秒产生多个请求,这可能会给服务器带来巨大的负载(特别是如果站点是用 PHP 编写的,对每个请求都进行 MySQL 访问,并且不使用缓存)。如果该网站位于共享主机上,则该负载可能会导致某人离开其主机。即使在 VPS 上,它也会让一些网站屈服。并且即使站点本身幸存下来,在几秒钟内被疯狂数量的请求轰炸看起来就像是 DOS 攻击,很可能会自动阻止您的 IP。如果你不 不知道该站点可以处理大量涌入的流量,使用 -w # switch.5 通常是相当安全的。大多数情况下,即使是 1 也可能没问题。但是使用一些东西。
回答by seanbreeden
The best way is to scrape it with wget
as suggested in @Abhijeet Rastogi's answer. If you aren't familiar with is then Blackwidow is a decent scraper. I've used it in the past. http://www.sbl.net/
最好的方法是wget
按照@Abhijeet Rastogi 的回答中的建议进行刮擦。如果您不熟悉 is 然后 Blackwidow 是一个不错的刮板。我过去用过。 http://www.sbl.net/