php file_get_contents 返回 403 forbidden

Question

提问by absk

I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error. I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:

我正在尝试制作网站爬虫。我在我的本地机器上做的，它在那里工作得很好。当我在我的服务器上执行相同的操作时，它显示 403 禁止错误。我正在使用PHP Simple HTML DOM Parser。我在服务器上得到的错误是这样的：

Warning: file_get_contents(http://example.com/viewProperty.html?id=7715888) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /home/scraping/simple_html_dom.php on line 40

警告：file_get_contents(http://example.com/viewProperty.html?id=7715888) [function.file-get-contents]：无法打开流：HTTP 请求失败！HTTP/1.1 403 Forbidden in /home/scraping/simple_html_dom.php 第 40 行

The line of code triggering it is:

触发它的代码行是：

$url="http://www.example.com/viewProperty.html?id=".$id;

$html=file_get_html($url);

I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.

我已经检查了服务器上的 php.ini 并且 allow_url_fopen 是 On。可能的解决方案是使用 curl，但我需要知道我哪里出错了。

Answer 1

采纳答案by Pekka

This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.

这不是您的脚本的问题，而是您请求的资源的问题。Web 服务器正在返回“禁止”状态代码。

It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.

可能是它阻止了 PHP 脚本以防止抓取，或者如果您发出了太多请求，则可能是您的 IP。

You should probably talk to the administrator of the remote server.

您可能应该与远程服务器的管理员交谈。

Answer 2

回答by Ikari

I know it's quite an old thread but thought of sharing some ideas.

我知道这是一个很旧的线程，但想分享一些想法。

Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agentheader in the HTTP request sent to the server.

如果您在访问网页时没有获得任何内容，很可能是它不希望您能够获得这些内容。那么它如何识别脚本正在尝试访问网页，而不是人类呢？一般是User-Agent发送到服务器的HTTP请求中的header。

So to make the website think that the script accessing the webpage is also a humanyou must change the User-Agentheader during the request. Most web servers would likely allow your request if you set the User-Agentheader to an value which is used by some common web browser.

因此，要使网站认为访问网页的脚本也是人，您必须User-Agent在请求期间更改标头。如果您将User-Agent标头设置为某些常见 Web 浏览器使用的值，则大多数 Web 服务器可能会允许您的请求。

A list of common user agents used by browsers are listed below:

下面列出了浏览器使用的常见用户代理列表：

Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
etc...

铬合金： 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
火狐： Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
等等...

$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

echo file_get_contents("www.google.com", false, $context);

This piece of code, fakes the user agent and sends the request to https://google.com.

这段代码伪造用户代理并将请求发送到https://google.com。

References:

参考：

stream_context_create

stream_context_create

Cheers!

干杯!

Answer 3

回答by Dejan Marjanovic

You can change it like this in parser class from line 35 and on.

您可以在第 35 行及以后的解析器类中像这样更改它。

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html()
{
  $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
}

Have you tried other site?

你试过其他网站吗？

Answer 4

回答by Uma Shankar Goel

When working on server to server calls, it is basically PHP script calling. Due to this many remote server block the calls with php scripts to avoid copy of websites. This can easily be overcome by making your script appear as if it is from main. You can use following code.

在处理服务器到服务器调用时，它基本上是 PHP 脚本调用。由于这个原因，许多远程服务器会阻止使用 php 脚本的调用，以避免复制网站。这可以通过使您的脚本看起来好像来自 main 来轻松克服。您可以使用以下代码。

$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

echo file_get_contents("https://www.google.co.in", false, $context);

Answer 5

回答by r0adtr1p

Write this in simple_html_dom.php for me it worked

在 simple_html_dom.php 中写这个对我有用

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
    //$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

}

Answer 6

回答by Sergi

It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:

远程服务器似乎有某种类型的阻塞。可能是user-agent，如果是这种情况，您可以尝试使用 curl 来模拟网络浏览器的用户代理，如下所示：

$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);

Answer 7

回答by CrookedCreek

I realize this is an old question, but...

我意识到这是一个老问题，但是......

Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.

只是在 linux 上用 php7 设置我的本地沙箱并遇到了这个问题。使用终端运行脚本，php 为 CLI 调用 php.ini。我发现“user_agent”选项被注释掉了。我取消了它的注释并添加了一个 Mozilla 用户代理，现在它可以工作了。

Answer 8

回答by Andrea Syd Coi

Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.

你检查过你的文件权限了吗？我在我的文件上设置了 777（显然是在本地主机中）并且我解决了这个问题。

Answer 9

回答by Daniel Renteria

You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.

您可能还需要在上下文中提供一些其他信息，以使网站相信请求来自人类。所做的是从浏览器进入网站并复制在 http 请求中发送的任何额外信息。

$context = stream_context_create(
        array(
            "http" => array(
                'method'=>"GET",
                "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) 
                            AppleWebKit/537.36 (KHTML, like Gecko) 
                            Chrome/50.0.2661.102 Safari/537.36\r\n" .
                            "accept: text/html,application/xhtml+xml,application/xml;q=0.9,
                            image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
                            "accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" . 
                            "accept-encoding: gzip, deflate, br\r\n"
            )
        )
    );

Answer 10

回答by sac

Use below code: if you use -> file_get_contents

使用以下代码：如果您使用 -> file_get_contents

$context  = stream_context_create(
  array(
    "http" => array(
      "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
    )
));

========= if You use curl,

========== 如果你使用 curl，

curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');

php file_get_contents 返回 403 forbidden

提问by absk

采纳答案by Pekka

回答by Ikari

回答by Dejan Marjanovic

回答by Uma Shankar Goel

回答by r0adtr1p

回答by Sergi

回答by CrookedCreek

回答by Andrea Syd Coi

回答by Daniel Renteria

回答by sac

相关推荐

最近更新

标签

php file_get_contents 返回 403 forbidden

提问by absk

采纳答案by Pekka

回答by Ikari

回答by Dejan Marjanovic

回答by Uma Shankar Goel

回答by r0adtr1p

回答by Sergi

回答by CrookedCreek

回答by Andrea Syd Coi

回答by Daniel Renteria

回答by sac

相关推荐

php 如何使用codeigniter生成5位字母数字唯一ID？

php 确定用户是否使用代理

检查 stdClass 对象在 PHP 中是否有“条目”

php Joomla：如何获取特定菜单项 ID 的 url？

相关推荐

最近更新

标签