php 如何从 .html 页面中提取链接和标题?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4423272/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 13:02:28  来源:igfitidea点击:

how to extract links and titles from a .html page?

phphtmlstringhyperlinkweb-crawler

提问by Toni Michel Caubet

for my website, i'd like to add a new functionality.

对于我的网站,我想添加一个新功能。

I would like user to be able to upload his bookmarks backup file (from any browser if possible) so I can upload it to their profile and they don't have to insert all of them manually...

我希望用户能够上传他的书签备份文件(如果可能,从任何浏览器),这样我就可以将其上传到他们的个人资料中,而他们不必手动插入所有这些...

the only part i'm missing to do this it's the part of extracting title and URL from the uploaded file.. can anyone give a clue where to start or where to read?

我唯一缺少的部分是从上传的文件中提取标题和 URL 的部分.. 谁能提供线索从哪里开始或从哪里阅读?

used search option and (How to extract data from a raw HTML file?) this is the most related question for mine and it doesn't talk about it..

使用的搜索选项和(如何从原始 HTML 文件中提取数据?)这是我最相关的问题,它没有谈论它..

I really don't mind if its using jquery or php

我真的不介意它是使用 jquery 还是 php

Thank you very much.

非常感谢。

回答by Toni Michel Caubet

Thank you everyone, I GOT IT!

谢谢大家,我知道了!

The final code:

最终代码:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

This shows you the anchortext assigned and the hreffor all links in a .htmlfile.

这将显示为.html文件中的所有链接分配的文本和href

Again, thanks a lot.

再次,非常感谢。

回答by Matthew

This is probably sufficient:

这可能就足够了:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}

回答by Simon Groenewolt

Assuming the stored links are in a html file the best solution is probably to use a html parser such as PHP Simple HTML DOM Parser(never tried it myself). (The other option is to search using basic string search or regexp, and you should probably neveruse regexp to parse html).

假设存储的链接在 html 文件中,最好的解决方案可能是使用 html 解析器,例如PHP Simple HTML DOM Parser(我自己从未尝试过)。(另一种选择是使用基本字符串搜索或 regexp 进行搜索,您可能永远不应该使用 regexp 来解析 html)。

After reading the html file using the parser use it's functions to find the atags:

使用解析器读取 html 文件后,使用它的函数来查找a标签:

from the tutorial:

从教程:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

回答by Adrian Cid Almaguer

This is an example, you can use in your case this:

这是一个例子,你可以在你的情况下使用:

$content = file_get_contents('bookmarks.html');

Run this:

运行这个:

<?php

$content = '<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>';

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(\:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&$_.-][a-z0-9;:@&%=+\/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));

Output:

输出:

Array
(
    [0] => http://clicklink.com
    [1] => http://foobar.com
)

http://clicklink.com

http://foobar.com

http://clicklink.com

http://foobar.com

回答by Raghavendra

$html = file_get_contents('your file path');

$dom = new DOMDocument;

@$dom->loadHTML($html);

$styles = $dom->getElementsByTagName('link');

$links = $dom->getElementsByTagName('a');

$scripts = $dom->getElementsByTagName('script');

foreach($styles as $style)
{

    if($style->getAttribute('href')!="#")

    {
        echo $style->getAttribute('href');
        echo'<br>';
    }
}

foreach ($links as $link){

    if($link->getAttribute('href')!="#")
    {
        echo $link->getAttribute('href');
        echo'<br>';
    }
}

foreach($scripts as $script)
{

        echo $script->getAttribute('src');
        echo'<br>';

}

回答by Tom Gould

I wanted to create a CSV of link paths and their text from html pages so I could rip menus etc from sites.

我想从 html 页面创建一个包含链接路径及其文本的 CSV 文件,这样我就可以从网站上抓取菜单等。

In this example you specify the domain you are interested in so you don't get off site links and then it produces a CSV per document

在这个例子中,你指定了你感兴趣的域,这样你就不会离开站点链接,然后它会为每个文档生成一个 CSV

/**
 * Extracts links to the given domain from the files and creates CSVs of the links
 */


$LinkExtractor = new LinkExtractor('https://www.example.co.uk');

$LinkExtractor->extract(__DIR__ . '/hamburger.htm');
$LinkExtractor->extract(__DIR__ . '/navbar.htm');
$LinkExtractor->extract(__DIR__ . '/footer.htm');

class LinkExtractor {
    public $domain;

    public function __construct($domain) {
      $this->domain = $domain;
    }

    public function extract($file) {
        $html = file_get_contents($file);
        //Create a new DOM document
        $dom = new DOMDocument;

        //Parse the HTML. The @ is used to suppress any parsing errors
        //that will be thrown if the $html string isn't valid XHTML.
        @$dom->loadHTML($html);

        //Get all links. You could also use any other tag name here,
        //like 'img' or 'table', to extract other tags.
        $links = $dom->getElementsByTagName('a');

        $results = [];
        //Iterate over the extracted links and display their URLs
        foreach ($links as $link){
            //Extract and sput the matching links in an array for the CSV
            $href = $link->getAttribute('href');
            $parts = parse_url($href);
            if (!empty($parts['path']) && strpos($this->domain, $parts['host']) !== false) {
                $results[$parts['path']] = [$parts['path'], $link->nodeValue];
            }
        }

        asort($results);
        // Make the CSV
        $fp = fopen($file .'.csv', 'w');
        foreach ($results as $fields) {
            fputcsv($fp, $fields);
        }
        fclose($fp);
    }
}

回答by Maniruzzaman Akash

Here is my work for one of my client and make it as a function to use everywhere.

这是我为我的一位客户所做的工作,并将其作为一个可以在任何地方使用的功能。

function getValidUrlsFrompage($source)
  {
    $links = [];
    $content = file_get_contents($source);
    $content = strip_tags($content, "<a>");
    $subString = preg_split("/<\/a>/", $content);
    foreach ($subString as $val) {
      if (strpos($val, "<a href=") !== FALSE) {
        $val = preg_replace("/.*<a\s+href=\"/sm", "", $val);
        $val = preg_replace("/\".*/", "", $val);
        $val = trim($val);
      }
      if (strlen($val) > 0 && filter_var($val, FILTER_VALIDATE_URL)) {
        if (!in_array($val, $links)) {
          $links[] = $val;
        }
      }
    }
    return $links;
  }

And use it like

并使用它

$links = getValidUrlsFrompage("https://www.w3resource.com/");

And The expected output is get 99 URLs in an array,

预期的输出是在一个数组中获取 99 个 URL,

Array ( [0] => https://www.w3resource.com [1] => https://www.w3resource.com/html/HTML-tutorials.php [2] => https://www.w3resource.com/css/CSS-tutorials.php [3] => https://www.w3resource.com/javascript/javascript.php [4] => https://www.w3resource.com/html5/introduction.php [5] => https://www.w3resource.com/schema.org/introduction.php [6] => https://www.w3resource.com/phpjs/use-php-functions-in-javascript.php [7] => https://www.w3resource.com/twitter-bootstrap/tutorial.php [8] => https://www.w3resource.com/responsive-web-design/overview.php [9] => https://www.w3resource.com/zurb-foundation3/introduction.php [10] => https://www.w3resource.com/pure/ [11] => https://www.w3resource.com/html5-canvas/ [12] => https://www.w3resource.com/course/javascript-course.html [13] => https://www.w3resource.com/icon/ [14] => https://www.w3resource.com/linux-system-administration/installation.php [15] => https://www.w3resource.com/linux-system-administration/linux-commands-introduction.php [16] => https://www.w3resource.com/php/php-home.php [17] => https://www.w3resource.com/python/python-tutorial.php [18] => https://www.w3resource.com/java-tutorial/ [19] => https://www.w3resource.com/node.js/node.js-tutorials.php [20] => https://www.w3resource.com/ruby/ [21] => https://www.w3resource.com/c-programming/programming-in-c.php [22] => https://www.w3resource.com/sql/tutorials.php [23] => https://www.w3resource.com/mysql/mysql-tutorials.php [24] => https://w3resource.com/PostgreSQL/tutorial.php [25] => https://www.w3resource.com/sqlite/ [26] => https://www.w3resource.com/mongodb/nosql.php [27] => https://www.w3resource.com/API/google-plus/tutorial.php [28] => https://www.w3resource.com/API/youtube/tutorial.php [29] => https://www.w3resource.com/API/google-maps/index.php [30] => https://www.w3resource.com/API/flickr/tutorial.php [31] => https://www.w3resource.com/API/last.fm/tutorial.php [32] => https://www.w3resource.com/API/twitter-rest-api/ [33] => https://www.w3resource.com/xml/xml.php [34] => https://www.w3resource.com/JSON/introduction.php [35] => https://www.w3resource.com/ajax/introduction.php [36] => https://www.w3resource.com/html-css-exercise/index.php [37] => https://www.w3resource.com/javascript-exercises/ [38] => https://www.w3resource.com/jquery-exercises/ [39] => https://www.w3resource.com/jquery-ui-exercises/ [40] => https://www.w3resource.com/coffeescript-exercises/ [41] => https://www.w3resource.com/php-exercises/ [42] => https://www.w3resource.com/python-exercises/ [43] => https://www.w3resource.com/c-programming-exercises/ [44] => https://www.w3resource.com/csharp-exercises/ [45] => https://www.w3resource.com/java-exercises/ [46] => https://www.w3resource.com/sql-exercises/ [47] => https://www.w3resource.com/oracle-exercises/ [48] => https://www.w3resource.com/mysql-exercises/ [49] => https://www.w3resource.com/sqlite-exercises/ [50] => https://www.w3resource.com/postgresql-exercises/ [51] => https://www.w3resource.com/mongodb-exercises/ [52] => https://www.w3resource.com/twitter-bootstrap/examples.php [53] => https://www.w3resource.com/euler-project/ [54] => https://w3resource.com/w3skills/html5-quiz/ [55] => https://w3resource.com/w3skills/php-fundamentals/ [56] => https://w3resource.com/w3skills/sql-beginner/ [57] => https://w3resource.com/w3skills/python-beginner-quiz/ [58] => https://w3resource.com/w3skills/mysql-basic-quiz/ [59] => https://w3resource.com/w3skills/javascript-basic-skill-test/ [60] => https://w3resource.com/w3skills/javascript-advanced-quiz/ [61] => https://w3resource.com/w3skills/javascript-quiz-part-iii/ [62] => https://w3resource.com/w3skills/mongodb-basic-quiz/ [63] => https://www.w3resource.com/form-template/ [64] => https://www.w3resource.com/slides/ [65] => https://www.w3resource.com/convert/number/binary-to-decimal.php [66] => https://www.w3resource.com/excel/ [67] => https://www.w3resource.com/video-tutorial/php/some-basics-of-php.php [68] => https://www.w3resource.com/video-tutorial/javascript/list-of-tutorial.php [69] => https://www.w3resource.com/web-development-tools/firebug-tutorials.php [70] => https://www.w3resource.com/web-development-tools/useful-web-development-tools.php [71] => https://www.facebook.com/w3resource [72] => https://twitter.com/w3resource [73] => https://plus.google.com/+W3resource [74] => https://in.linkedin.com/in/w3resource [75] => https://feeds.feedburner.com/W3resource [76] => https://www.w3resource.com/ruby-exercises/ [77] => https://www.w3resource.com/graphics/matplotlib/ [78] => https://www.w3resource.com/python-exercises/numpy/index.php [79] => https://www.w3resource.com/python-exercises/pandas/index.php [80] => https://w3resource.com/plsql-exercises/ [81] => https://w3resource.com/swift-programming-exercises/ [82] => https://www.w3resource.com/angular/getting-started-with-angular.php [83] => https://www.w3resource.com/react/react-js-overview.php [84] => https://www.w3resource.com/vue/installation.php [85] => https://www.w3resource.com/jest/jest-getting-started.php [86] => https://www.w3resource.com/numpy/ [87] => https://www.w3resource.com/php/composer/a-gentle-introduction-to-composer.php [88] => https://www.w3resource.com/php/PHPUnit/a-gentle-introduction-to-unit-test-and-testing.php [89] => https://www.w3resource.com/laravel/laravel-tutorial.php [90] => https://www.w3resource.com/oracle/index.php [91] => https://www.w3resource.com/redis/index.php [92] => https://www.w3resource.com/cpp-exercises/ [93] => https://www.w3resource.com/r-programming-exercises/ [94] => https://w3resource.com/w3skills/ [95] => https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US [96] => https://www.w3resource.com/privacy.php [97] => https://www.w3resource.com/about.php [98] => https://www.w3resource.com/contact.php [99] => https://www.w3resource.com/feedback.php [100] => https://www.w3resource.com/advertise.php )

Hope, this will help someone. And here is a gist - https://gist.github.com/ManiruzzamanAkash/74cffb9ffdfc92f57bd9cf214cf13491

希望,这会帮助某人。这是一个要点 - https://gist.github.com/ManiruzzamanAkash/74cffb9ffdfc92f57bd9cf214cf13491