windows 如何从网站上抓取所有内容？

Question

提问by cklingdesigns

I develop websites and sometimes clients already have websites but need them totally revamped but most of the content and images need to stay the same. I'm looking for software, even if it costs or is a desktop application that will easily allow me to enter a URL and scrape all content to a designated folder on my local machine. Any help would be much appreciated.

我开发网站，有时客户已经拥有网站，但需要对其进行彻底改造，但大多数内容和图像需要保持不变。我正在寻找软件，即使它很贵或者是一个桌面应用程序，它可以让我轻松输入 URL 并将所有内容抓取到本地计算机上的指定文件夹中。任何帮助将非常感激。

Answer 1

回答by k to the z

htttrackwill work just fine for you. It is an offline browser that will pull down websites. You can configure it as you wish. This will not pull down PHP obviously since php is server side code. The only thing you can pull down is html and javascript and any images pushed to the browser.

htttrack对你来说很好用。它是一个离线浏览器，可以拉下网站。您可以根据需要进行配置。这显然不会拉低 PHP，因为 php 是服务器端代码。您唯一可以下拉的是 html 和 javascript 以及推送到浏览器的任何图像。

Answer 2

回答by John Cartwright

file_put_contents('/some/directory/scrape_content.html', file_get_contents('http://google.com'));

Save your money for charity.

为慈善事业存钱。

Answer 3

回答by Tony Lukasavage

By content do you mean the entire page contents, cause you can just "save as..." the whole page with most of the included media.

您所说的内容是指整个页面的内容，因为您可以“另存为...”包含大部分媒体的整个页面。

Firefox, in Tool -> Page Info -> Media, includes a listing of every piece of media on the page that you can download.

Firefox 在工具 -> 页面信息 -> 媒体中，包含页面上您可以下载的所有媒体的列表。

Answer 4

回答by Marc B

Don't bother with PHP for something like this. You can use wgetto grab an entire site trivially. However, be aware that it won't parse things like CSS for you, so it won't grab any files referenced via (say) background-image: URL('/images/pic.jpg'), but will snag most everything else for you.

不要为这样的事情烦恼 PHP。您可以使用它wget来轻松抓取整个站点。但是，请注意，它不会为您解析 CSS 之类的内容，因此它不会抓取通过 (say) 引用的任何文件background-image: URL('/images/pic.jpg')，但会为您抓取大部分其他内容。

Answer 5

回答by Klaus S.

This class can help you scrape the content: http://simplehtmldom.sourceforge.net/

这个类可以帮你抓取内容：http: //simplehtmldom.sourceforge.net/

Answer 6

回答by OguzKaganAslan

You can scrape websites with http://scrapy.organd get the content you want.

您可以使用http://scrapy.org抓取网站并获取您想要的内容。

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy 是一个快速的高级屏幕抓取和网络抓取框架，用于抓取网站并从其页面中提取结构化数据。它可用于广泛的用途，从数据挖掘到监控和自动化测试。

Answer 7

回答by Pete Wilson

I started using HTTracka couple of years ago and I'm happy with it. It seems to go out of its way to get pages I wouldn't even see on my own.

几年前我开始使用HTTrack，我对它很满意。它似乎不遗余力地获取我什至无法自己看到的页面。

Answer 8

回答by jimy

You can achieve this by save as option of the browser go to file->save page as in firefox and all the images and js will be saved in one folder

你可以通过浏览器的另存为选项来实现这一点，在 firefox 中转到文件-> 保存页面，所有图像和 js 将保存在一个文件夹中

windows 如何从网站上抓取所有内容？

提问by cklingdesigns

回答by k to the z

回答by John Cartwright

回答by Tony Lukasavage

回答by Marc B

回答by Klaus S.

回答by OguzKaganAslan

回答by Pete Wilson

回答by jimy

相关推荐

最近更新

标签

windows 如何从网站上抓取所有内容？

提问by cklingdesigns

回答by k to the z

回答by John Cartwright

回答by Tony Lukasavage

回答by Marc B

回答by Klaus S.

回答by OguzKaganAslan

回答by Pete Wilson

回答by jimy

相关推荐

如何使用批处理文件更改 Windows 中文件的“已修改”时间戳？

32 位 Windows 上的 C++ 应用程序可用的最大内存是多少？

windows 从另一个 HBITMAP 复制位图

windows 优雅地退出资源管理器（以编程方式）

相关推荐

最近更新

标签