python 如何编写python脚本来搜索网站html以获取匹配链接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2376798/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:29:59  来源:igfitidea点击:

How to write a python script to search a website html for matching links

pythonscrape

提问by GeminiDNK

I am not too familiar with python and have to write a script to perform a host of functions. Basically the module i still need is how to check a website code for matching links provided beforehand.

我对python不太熟悉,必须编写一个脚本来执行许多功能。基本上我仍然需要的模块是如何检查网站代码以获取预先提供的匹配链接。

回答by Nick Presta

Matching links what? Their HREF attribute? The link display text? Perhaps something like:

匹配链接是什么?他们的 HREF 属性?链接显示文本?也许是这样的:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2

doc = urllib2.urlopen("http://somesite.com").read()
links = SoupStrainer('a', href=re.compile(r'^test'))
soup = [str(elm) for elm in BeautifulSoup(doc, parseOnlyThese=links)]
for elm in soup:
    print elm

That will grab the HTML content of somesite.comand then parse it using BeautifulSoup, looking only for links whose HREF attribute starts with "test". It then builds a list of these links and prints them out.

这将抓取 的 HTML 内容,somesite.com然后使用 BeautifulSoup 解析它,只查找 HREF 属性以“test”开头的链接。然后它会构建这些链接的列表并将它们打印出来。

You can modify this to do anything using the documentation.

您可以修改它以使用文档做任何事情。

回答by ghostdog74

Generally, you use urllib, urllib2(htmllib etc) for programming web in Python. you could also use mechanize, curletc. Then for processing HTML and getting links, you would want to use parsers like BeautifulSoup.

通常,您使用urlliburllib2( htmllib 等)在 Python 中进行 Web 编程。你也可以使用mechanizecurl等。然后为了处理 HTML 和获取链接,你会想要使用像BeautifulSoup这样的解析器。

回答by Frederic Bazin

try scrapy , the most comprehensive web extraction framework.

试试scrapy,最全面的网页提取框架。

http://scrapy.org

http://scrapy.org