php curl 无法获取网页内容,为什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/814149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 23:56:40  来源:igfitidea点击:

curl not working for getting a web page content, why?

phpcurlscreen-scrapingweb-scraping

提问by Sotheby it

i am using a curl script to go to a link and get its content for further manipulation. following is the link and curl script:

我正在使用 curl 脚本转到链接并获取其内容以进行进一步操作。以下是链接和 curl 脚本:

<?php 
$url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&amp;templateName=detail.htm&amp;requestingHandler=WebNSORDetailHandler&amp;ID=368343543';

//curl script to get content of given url

$ch = curl_init();

// set the target url

curl_setopt($ch, CURLOPT_URL,$url);

// request as if Firefox

curl_setopt($ch, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); 
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>

but the website is not excepting it through script it is giving user exception in result, but if we normally paste the url in browser it is opening the page perfectly alright.

但是该网站并没有通过脚本排除它,它在结果中给了用户异常,但是如果我们通常将 url 粘贴到浏览器中,它可以完美地打开页面。

Please help, what i am doing wrong here.

请帮忙,我在这里做错了什么。

Thanks and regards

感谢致敬

回答by Alan Storm

I ran the following program/script and the page was downloaded correctly. This most likely means the server you're running your script from can't reach the server at "criminaljustice.state.ny.us". This is either because your server is mis-configured, or their server is explicitly blocking you, which is a common result of aggressive screen scraping.

我运行了以下程序/脚本并且页面已正确下载。这很可能意味着您运行脚本的服务器无法访问位于“criminaljustice.state.ny.us”的服务器。这要么是因为您的服务器配置错误,要么是他们的服务器明确阻止了您,这是激进屏幕抓取的常见结果。

<?php
$url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&templateName=detail.htm&requestingHandler=WebNSORDetailHandler&ID=368343543';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); 
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;

Additional troubleshooting tip -- if you have shell access to the machine your PHP script is running from, run the following command

其他故障排除技巧——如果您对运行 PHP 脚本的机器具有 shell 访问权限,请运行以下命令

curl -I 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&templateName=detail.htm&requestingHandler=WebNSORDetailHandler&ID=368343543'

This will output the response headers, which may contain some clue as to why your request is failing.

这将输出响应标头,其中可能包含有关请求失败原因的一些线索。

回答by Sotheby it

I had the same issue which ended up being the followlocation option not being set. I thought curl would set it to true by default but I guess not!? Once I set it it got the full site no problem

我遇到了同样的问题,最终没有设置 followlocation 选项。我以为 curl 默认情况下会将其设置为 true 但我猜不是!?一旦我设置它就可以得到完整的站点没问题

回答by xkcd150

For useragent i think you want to use the CURLOPT_USERAGENT constant

对于用户代理,我认为您想使用 CURLOPT_USERAGENT 常量

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");

回答by alex

Is the user agent meant to be in an array like that? I haven't seen it done like that before.

用户代理是否应该在这样的数组中?我以前从未见过这样做过。

Try just using a plain string, i.e.

尝试只使用一个普通的字符串,即

curl_setopt($ch, CURLOPT_HTTPHEADER, 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15');