php 抓取网页内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/584826/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrape web page contents
提问by Sakthivel
I am developing a project, for which I want to scrape the contents of a website in the background and get some limited content from that scraped website. For example, in my page I have "userid" and "password" fields, by using those I will access my mail and scrape my inbox contents and display it in my page.
我正在开发一个项目,我想在后台抓取网站的内容,并从该抓取的网站中获取一些有限的内容。例如,在我的页面中,我有“userid”和“password”字段,通过使用这些字段,我将访问我的邮件并抓取我的收件箱内容并将其显示在我的页面中。
I done the above by using javascript alone. But when I click the sign in button the URL of my page (http://localhost/web/Login.html) is changed to the URL (http://mail.in.com/mails/inbox.php?nomail=....) which I am scraped. But I scrap the details without changing my url.
我单独使用 javascript 完成了上述操作。但是当我单击登录按钮时,我的页面的 URL ( http://localhost/web/Login.html) 更改为 URL ( http://mail.in.com/mails/inbox.php?nomail= ....) 我被刮了。但是我在不更改网址的情况下删除了详细信息。
回答by givp
Definitely go with PHP Simple HTML DOM Parser. It's fast, easy and super flexible. It basically sticks an entire HTML page in an object then you can access any element from that object.
绝对使用PHP Simple HTML DOM Parser。它快速、简单且超级灵活。它基本上将整个 HTML 页面粘贴在一个对象中,然后您可以访问该对象中的任何元素。
Like the example of the official site, to get all links on the main Google page:
像官方网站的例子一样,获取谷歌主页上的所有链接:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
回答by John Slegers
The HTTP Request
HTTP 请求
First, you make an HTTP request to get the content of the page. There are several ways to do that.
首先,您发出 HTTP 请求以获取页面内容。有几种方法可以做到这一点。
fopen
打开
The most basic way to send an HTTP request, is to use fopen. A main advantage is that you can set how many characters are read at a time, which can be useful when reading very large files. It's not the easiest thing to do correctly, though, and it's not recommended to do this unless you're reading very large files and fear running into memory issues.
发送 HTTP 请求的最基本方法是使用fopen. 一个主要优点是您可以设置一次读取多少个字符,这在读取非常大的文件时非常有用。但是,这并不是最容易正确执行的操作,除非您正在阅读非常大的文件并且担心遇到内存问题,否则不建议这样做。
$fp = fopen("http://www.4wtech.com/csp/web/Employee/Login.csp", "rb");
if (FALSE === $fp) {
exit("Failed to open stream to URL");
}
$result = '';
while (!feof($fp)) {
$result .= fread($fp, 8192);
}
fclose($fp);
echo $result;
file_get_contents
文件获取内容
The easiest way, is just using file_get_contents. If does more or less the same as fopen, but you have less options to choose from. A main advantage here is that it requires but one line of code.
最简单的方法,就是使用file_get_contents. If 与 fopen 或多或少相同,但您可以选择的选项较少。这里的一个主要优点是它只需要一行代码。
$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
echo $result;
sockets
插座
If you need more control of what headers are sent to the server, you can use sockets, in combination with fopen.
如果您需要更多地控制将哪些标头发送到服务器,您可以将套接字与fopen.
$fp = fsockopen("www.4wtech.com/csp/web/Employee/Login.csp", 80, $errno, $errstr, 30);
if (!$fp) {
$result = "$errstr ($errno)<br />\n";
} else {
$result = '';
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.4wtech.com/csp/web/Employee/Login.csp\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
while (!feof($fp)) {
$result .= fgets($fp, 128);
}
fclose($fp);
}
echo $result;
streams
流
Alternatively, you can also use streams. Streams are similar to sockets and can be used in combination with both fopenand file_get_contents.
或者,您也可以使用流。流类似于套接字,可以与fopen和结合使用file_get_contents。
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$context = stream_context_create($opts);
$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp', false, $context);
echo result;
cURL
卷曲
If your server supports cURL (it usually does), it is recommended to use cURL. A key advantage of using cURL, is that it relies on a popular C library commonly used in other programming languages. It also provides a convenient way for creating request headers, and auto-parses response headers, with a simple interface in case of errors.
如果您的服务器支持 cURL(通常支持),建议使用 cURL。使用 cURL 的一个关键优势是它依赖于其他编程语言中常用的流行 C 库。它还提供了一种创建请求标头和自动解析响应标头的便捷方法,并提供了一个简单的接口以防万一。
$defaults = array(
CURLOPT_URL, "http://www.4wtech.com/csp/web/Employee/Login.csp"
CURLOPT_HEADER=> 0
);
$ch = curl_init();
curl_setopt_array($ch, ($options + $defaults));
if( ! $result = curl_exec($ch)) {
trigger_error(curl_error($ch));
}
curl_close($ch);
echo $result;
Libraries
图书馆
Alternatively, you can use one of many PHP libraries. I wouldn't recommend using a library, though, as it's likely to be overkill. In most cases, you're better off writing your own HTTP class using cURL under the hood.
或者,您可以使用许多 PHP 库之一。不过,我不建议使用库,因为它可能会矫枉过正。在大多数情况下,最好在幕后使用 cURL 编写自己的 HTTP 类。
The HTML parsing
HTML 解析
PHP has a convenient way to load any HTML into a DOMDocument.
PHP 有一种方便的方法可以将任何 HTML 加载到DOMDocument.
$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$doc = new DOMDocument();
$doc->loadHTML($pagecontent);
echo $doc->saveHTML();
Unfortunately, PHP support for HTML5 is limited. If you run into errors trying to parse your page content, consider using a third party library. For that, I can recommend Masterminds/html5-php. Parsing an HTML file with this library is very similar to parsing an HTML file with DOMDocument.
不幸的是,PHP 对 HTML5 的支持是有限的。如果您在尝试解析页面内容时遇到错误,请考虑使用第三方库。为此,我可以推荐Masterminds/html5-php。使用此库解析 HTML 文件与使用 .html 解析 HTML 文件非常相似DOMDocument。
use Masterminds\HTML5;
$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
echo $html5->saveHTML($dom);
Alternatively, you can use eg. my library PHPPowertools/DOM-Query. It uses customized version of Masterminds/html5-phpunder the hood parsing an HTML5 string into a DomDocumentand symfony/DomCrawlerfor conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.
或者,您可以使用例如。我的图书馆PHPPowertools/DOM-Query。它在引擎盖下使用自定义版本的Masterminds/html5-php将 HTML5 字符串解析为 aDomDocument和symfony/DomCrawler以将 CSS 选择器转换为 XPath 选择器。它始终使用相同的DomDocument,即使将一个对象传递给另一个对象,以确保良好的性能。
namespace PowerTools;
// Get file content
$pagecontent = file_get_contents( 'http://www.4wtech.com/csp/web/Employee/Login.csp' );
// Define your DOMCrawler based on file string
$H = new DOM_Query( $pagecontent );
// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );
// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );
// Passing an element object (DOM Element)
$s = $H->select( $documentBody );
// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );
// Select the body tag
$body = $H->select('body');
// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');
// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');
// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
return $i . " - " . $val->attr('class');
});
// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');
// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');
// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));
// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});
// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();
// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');
// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');
回答by cruizer
You can use the cURL extension of PHP to do HTTP requests to another web site from within your PHP page script. See the documentation here.
您可以使用 PHP 的 cURL 扩展从您的 PHP 页面脚本中向另一个网站发出 HTTP 请求。请参阅此处的文档。
Of course the downside here is that your site will respond slowly because you will have to scrape the external web site before you can present the full page/output to your user.
当然,这里的缺点是您的网站响应缓慢,因为您必须先抓取外部网站,然后才能将完整页面/输出呈现给您的用户。
回答by Zee Rottis
Have you tried OutWit Hub? It's a whole scraping environment. You can let it try to guess the structure or develop your own scrapers. I really suggest you have a look at it. It made my life much simpler. ZR
您是否尝试过 OutWit Hub?这是一个完整的抓取环境。您可以让它尝试猜测结构或开发自己的刮板。我真的建议你看看它。它让我的生活变得更加简单。ZR

