php str_get_html 未加载有效的 html 字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14172467/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
str_get_html is not loading a valid html string
提问by Dani
I receive an html string using curl:
我使用 curl 收到一个 html 字符串:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html_string = curl_exec($ch);
When I echoit I see a perfectly good html as I require for my parsing needs.
But, When trying to send this string to HTML DOM PARSERmethod str_get_html($html_string), It would not upload it (returns false from the method invocation).
当我echo看到它时,我看到了一个非常好的 html,因为我需要我的解析需求。但是,当尝试将此字符串发送到HTML DOM PARSERmethod 时str_get_html($html_string),它不会上传它(从方法调用返回 false)。
I tried saving it to file and opening with file_get_htmlon the file, but the same thing occurs.
我尝试将其保存到文件并打开file_get_html文件,但发生了同样的事情。
What can be the cause of this? As I said, the html looks perfectly fine when I echo it.
这可能是什么原因?正如我所说,当我回显时,html 看起来非常好。
Thanks a lot.
非常感谢。
The code itself:
代码本身:
$html = file_get_html("http://www.bgu.co.il/tremp.aspx");
$v = $html->find('input[id=__VIEWSTATE]');
$viewState = $v[0]->attr['value'];
$e = $html->find('input=[id=__EVENTVALIDATION]');
$event = $e[0]->attr['value'];
$html->clear();
unset($html);
$body = " A_STRING_THAT_CONTAINS_SOME_DATA "
$ch = curl_init("http://www.bgu.co.il/tremp.aspx");
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html_string = curl_exec($ch);
$file_handle = fopen("file.txt", "w");
fwrite($file_handle, $html_string);
fclose($file_handle);
curl_close($ch);
$html = str_get_html($html_string);
回答by twxia
You curl link seems have many element(large file).
您的 curl 链接似乎有很多元素(大文件)。
And I am parsing a string(file) as large as your link and encounter this problem.
我正在解析一个与您的链接一样大的字符串(文件)并遇到此问题。
After I saw the source code, I found the problem. It works for me !
看了源码,发现问题所在。这个对我有用 !
I found that simple_html_dom.php have limit the size you read.
我发现 simple_html_dom.php 限制了您阅读的大小。
// get html dom from string
function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_B R_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
if (empty($str) || strlen($str) > MAX_FILE_SIZE)
{
$dom->clear();
return false;
}
$dom->load($str, $lowercase, $stripRN);
return $dom;
}
you must to change the default size below (It's on the top of the simple_html_dom.php)
maybe change to 100000000 ? it's up to you.
您必须更改下面的默认大小(它在 simple_html_dom.php 的顶部)
可能更改为 100000000 ?由你决定。
define('MAX_FILE_SIZE', 6000000);
回答by florian h
Did you check if the HTML is somehow encoded in a way HTML DOM PARSER doesn't expect? E.g. with HTML entities like <html>instead of <html>–?that would still be displayed as correct HTML in your browser but wouldn't parse.
您是否检查过 HTML 是否以 HTML DOM PARSER 不期望的方式编码?例如,使用 HTML 实体<html>而不是<html>–? 仍然会在浏览器中显示为正确的 HTML,但不会解析。
回答by FerCa
I asume that you are using curl + str_get_html instead of simply using file_get_html with the URL because of the POST parameters you need to send.
我假设您使用 curl + str_get_html 而不是简单地将 file_get_html 与 URL 一起使用,因为您需要发送 POST 参数。
You can use this W3C validator (http://validator.w3.org/#validate_by_input+with_options) to validate the returned HTML, then, once you are sure the result is a 100% valid HTML code you can report a bug here: http://sourceforge.net/p/simplehtmldom/bugs/.
您可以使用此 W3C 验证器 ( http://validator.w3.org/#validate_by_input+with_options) 来验证返回的 HTML,然后,一旦您确定结果是 100% 有效的 HTML 代码,您可以在此处报告错误:http://sourceforge.net/p/simplehtmldom/bugs/。

