通过 PHP 从网站中提取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2019892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 04:42:18  来源:igfitidea点击:

Extract data from website via PHP

phpregexcurlhtml-parsing

提问by Mike

I am trying to create a simple alert app for some friends.

我正在尝试为一些朋友创建一个简单的警报应用程序。

Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:

基本上我希望能够从如下两个网页中提取数据“价格”和“库存可用性”:

I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.

我已经通过电子邮件和短信部分发出警报,但现在我希望能够从网页(这两个或任何其他网页)中获取数量和价格,以便我可以比较可用的价格和数量并提醒我们如果产品介于某些阈值之间,则下订单。

I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?

我已经尝试了一些正则表达式(在一些教程中找到,但我对这个太n00b了)但还没有设法让它工作,有什么好的提示或例子吗?

回答by Matteo Riva

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');

preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];

preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];

echo "Price: $price - Availability: $in_stock\n";

回答by troelskn

It's called screen scraping, in case you need to google for it.

这叫做屏幕抓取,以防你需要谷歌搜索。

I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.

我建议您改用 dom 解析器和 xpath 表达式。首先通过 HtmlTidy 提供 HTML,以确保它是有效的标记。

For example:

例如:

$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[@class="pricing"]/th') as $node) {
  echo $node, "\n";
}

回答by Mike

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parserinstead.

无论你做什么:不要使用正则表达式来解析 HTML 否则会发生不好的事情。改用解析器

回答by Pekka

You are probably best off loading the HTML code into a DOM parser like this oneand searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.

您可能最好将 HTML 代码加载到像这样的 DOM 解析器中并搜索“定价”表。但是,只要他们更改页面布局,您所做的任何类型的抓取都可能会中断,并且未经他们的同意可能是非法的。

The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

不过,最好的方法是与运营该站点的人交谈,看看他们是否有其他更可靠的数据交付形式(想到 Web 服务、RSS 或数据库导出)。

回答by Viet

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:

第一,问这个问题太详细了。第二,从网站提取数据可能不合法。不过,我有提示:

  1. Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information

  2. Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)

  3. Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)

  1. 使用 Firebug 或 Chrome/Safari Inspector 探索 HTML 内容和有趣信息的模式

  2. 测试您的 RegEx 以查看是否匹配。您可能需要多次执行(多遍解析/提取)

  3. 通过 cURL 甚至更简单的方式编写客户端,使用 file_get_contents(请注意,某些主机禁止使用 file_get_contents 加载 URL)

For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.

对我来说,我最好使用 Tidy 转换为有效的 XHTML,然后使用 XPath 提取数据,而不是 RegEx。为什么?因为 XHTML 不是常规的,而 XPath 非常灵活。您可以学习 XSLT 进行转换。

Good luck!

祝你好运!

回答by Viet

The simplest method to extract data from Website. I've analysed that my all data is covered within tag only, so I've prepared this one.

从网站中提取数据的最简单方法。我分析过我的所有数据都只包含在标签中,所以我准备了这个。

<?php
    include(‘simple_html_dom.php');
        // Create DOM from URL, paste your destined web url in $page 
        $page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/';
        $html = new simple_html_dom();

       //Within $html your webpage will be loaded for further operation
        $html->load_file($page);

        // Find all links
        $links = array();
        //Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
       foreach($html->find(‘h3′) as $element) 
        {
            $links[] = $element;
        }
        reset($links);
        //$out will be having each of HTML element content you searching for, within that web page
        foreach ($links as $out) 
        {
            echo $out;
        }                

?>