如何在 PHP 中读取网页
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2259892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read a web page in PHP
提问by user1686
I'm trying to save some web pages to text files using PHP scripts.
我正在尝试使用 PHP 脚本将一些网页保存到文本文件中。
How can I load a web page into a file buffer with PHP and remove HTML tags?
如何使用 PHP 将网页加载到文件缓冲区并删除 HTML 标签?
回答by user1686
- The easy way:
fopen()orfile_get_contents()the URL:fopen("http://google.com/", "r") - The smart way:Use the cURLlibrary
- The other smart way:
http_get()from PHP'shttpmodule - The hard way:Craft a HTTP request and send it with
fsockopen()orstream_socket_client() - The C way:Send a HTTP request using sockets
- The stupid way:call an external tool such as
wgetorcurlthroughsystem()
- 简单的方法:
fopen()或file_get_contents()URL:fopen("http://google.com/", "r") - 聪明的方法:使用cURL库
- 另一个聪明的方法:
http_get()来自 PHP 的http模块 - 困难的方法:制作一个 HTTP 请求并使用
fsockopen()或发送它stream_socket_client() - C 方式:使用套接字发送 HTTP 请求
- 愚蠢的方法:调用外部工具,例如
wget或curl通过system()
None of these is guaranteed to be available on your server though.
但是,这些都不能保证在您的服务器上可用。
回答by ghostdog74
One way:
单程:
$url = "http://www.brothersoft.com/publisher/xtracomponents.html";
$page = file_get_contents($url);
$outfile = "xtracomponents.html";
file_put_contents($outfile, $page);
The code above is just an example and lacks any(!) error checking and handling.
上面的代码只是一个例子,没有任何(!)错误检查和处理。
回答by Tim Yates
As the other answers have said, either standard PHP stream functions or cURL is your best bet for retrievingthe HTML. As for removing the tags, here are a couple approaches:
正如其他答案所说,标准 PHP 流函数或 cURL 是检索HTML 的最佳选择。至于删除标签,这里有几种方法:
Option #1: Use the Tidy extension, if available on your server, to walk through the document tree recursively and return the text from the nodes. Something like this:
选项 #1:使用 Tidy 扩展(如果在您的服务器上可用)递归遍历文档树并从节点返回文本。像这样的东西:
function textFromHtml(TidyNode $node) {
if ($node->isText()) {
return $node->value;
} else if ($node->hasChildren()) {
$childText = '';
foreach ($node->child as $child)
$childText .= textFromHtml($child);
return $childText;
}
return '';
}
You might want something more sophisticated than that, e.g., that replaces <br />tags (where $node->name == 'br') with newlines, but this will do for a start.
您可能想要比这更复杂的东西,例如,用换行符替换<br />标签 (where $node->name == 'br'),但这只是一个开始。
Then, load the text of the HTML into a Tidy object and call your function on the body node. If you have the contents in a string, use:
然后,将 HTML 文本加载到 Tidy 对象中,并在 body 节点上调用您的函数。如果您有字符串中的内容,请使用:
$tidy = new tidy();
$tidy->parseString($contents);
$text = textFromHtml($tidy->body());
Option #2: Use regexes to strip everything between <and >. You could (and probably should) develop a more sophisticated regex that, for example, matched only valid HTML start or end tags. Any errors in the synax of the page, like a stray angle bracket in body text, could mean garbage output if you aren't careful. This is why Tidy is so nice (it is specifically designed to clean up bad pages), but it might not be available.
选项#2:使用正则表达式去除<和之间的所有内容>。您可以(并且可能应该)开发一个更复杂的正则表达式,例如,仅匹配有效的 HTML 开始或结束标记。如果您不小心,页面语法中的任何错误,例如正文中的杂散尖括号,都可能意味着垃圾输出。这就是为什么 Tidy 这么好(它是专门为清理坏页而设计的),但它可能不可用。
回答by Kemo
I strongly recommend you to take a look at SimpleHTML DOM class;
我强烈建议您查看 SimpleHTML DOM 类;
SimpleHTML DOM Parser at SourceForge
SourceForge 上的 SimpleHTML DOM 解析器
With it you can search the DOM tree using css selectors like with jQuery's $() function or prototypeJS $$() function.
有了它,您可以使用 css 选择器搜索 DOM 树,例如 jQuery 的 $() 函数或prototypeJS $$() 函数。
Although it works with file_get_contents() to get content of a web page, you can pass it HTML only with some cURL class of yours ( if you need to login etc. )
尽管它与 file_get_contents() 一起使用来获取网页的内容,但您只能将 HTML 传递给您的某些 cURL 类(如果您需要登录等)

