windows 如何从网站上抓取所有内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5779623/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to scrape all content from a website?
提问by cklingdesigns
I develop websites and sometimes clients already have websites but need them totally revamped but most of the content and images need to stay the same. I'm looking for software, even if it costs or is a desktop application that will easily allow me to enter a URL and scrape all content to a designated folder on my local machine. Any help would be much appreciated.
我开发网站,有时客户已经拥有网站,但需要对其进行彻底改造,但大多数内容和图像需要保持不变。我正在寻找软件,即使它很贵或者是一个桌面应用程序,它可以让我轻松输入 URL 并将所有内容抓取到本地计算机上的指定文件夹中。任何帮助将非常感激。
回答by k to the z
htttrackwill work just fine for you. It is an offline browser that will pull down websites. You can configure it as you wish. This will not pull down PHP obviously since php is server side code. The only thing you can pull down is html and javascript and any images pushed to the browser.
htttrack对你来说很好用。它是一个离线浏览器,可以拉下网站。您可以根据需要进行配置。这显然不会拉低 PHP,因为 php 是服务器端代码。您唯一可以下拉的是 html 和 javascript 以及推送到浏览器的任何图像。
回答by John Cartwright
file_put_contents('/some/directory/scrape_content.html', file_get_contents('http://google.com'));
Save your money for charity.
为慈善事业存钱。
回答by Tony Lukasavage
By content do you mean the entire page contents, cause you can just "save as..." the whole page with most of the included media.
您所说的内容是指整个页面的内容,因为您可以“另存为...”包含大部分媒体的整个页面。
Firefox, in Tool -> Page Info -> Media, includes a listing of every piece of media on the page that you can download.
Firefox 在工具 -> 页面信息 -> 媒体中,包含页面上您可以下载的所有媒体的列表。
回答by Marc B
Don't bother with PHP for something like this. You can use wget
to grab an entire site trivially. However, be aware that it won't parse things like CSS for you, so it won't grab any files referenced via (say) background-image: URL('/images/pic.jpg')
, but will snag most everything else for you.
不要为这样的事情烦恼 PHP。您可以使用它wget
来轻松抓取整个站点。但是,请注意,它不会为您解析 CSS 之类的内容,因此它不会抓取通过 (say) 引用的任何文件background-image: URL('/images/pic.jpg')
,但会为您抓取大部分其他内容。
回答by Klaus S.
This class can help you scrape the content: http://simplehtmldom.sourceforge.net/
这个类可以帮你抓取内容:http: //simplehtmldom.sourceforge.net/
回答by OguzKaganAslan
You can scrape websites with http://scrapy.organd get the content you want.
您可以使用http://scrapy.org抓取网站并获取您想要的内容。
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Scrapy 是一个快速的高级屏幕抓取和网络抓取框架,用于抓取网站并从其页面中提取结构化数据。它可用于广泛的用途,从数据挖掘到监控和自动化测试。
回答by Pete Wilson
回答by jimy
You can achieve this by save as option of the browser go to file->save page as in firefox and all the images and js will be saved in one folder
你可以通过浏览器的另存为选项来实现这一点,在 firefox 中转到文件-> 保存页面,所有图像和 js 将保存在一个文件夹中