python 如何编写python脚本来搜索网站html以获取匹配链接

Question

提问by GeminiDNK

I am not too familiar with python and have to write a script to perform a host of functions. Basically the module i still need is how to check a website code for matching links provided beforehand.

我对python不太熟悉，必须编写一个脚本来执行许多功能。基本上我仍然需要的模块是如何检查网站代码以获取预先提供的匹配链接。

Answer 1

回答by Nick Presta

Matching links what? Their HREF attribute? The link display text? Perhaps something like:

匹配链接是什么？他们的 HREF 属性？链接显示文本？也许是这样的：

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2

doc = urllib2.urlopen("http://somesite.com").read()
links = SoupStrainer('a', href=re.compile(r'^test'))
soup = [str(elm) for elm in BeautifulSoup(doc, parseOnlyThese=links)]
for elm in soup:
    print elm

That will grab the HTML content of somesite.comand then parse it using BeautifulSoup, looking only for links whose HREF attribute starts with "test". It then builds a list of these links and prints them out.

这将抓取的 HTML 内容，somesite.com然后使用 BeautifulSoup 解析它，只查找 HREF 属性以“test”开头的链接。然后它会构建这些链接的列表并将它们打印出来。

You can modify this to do anything using the documentation.

您可以修改它以使用文档做任何事情。

Answer 2

回答by ghostdog74

Generally, you use urllib, urllib2(htmllib etc) for programming web in Python. you could also use mechanize, curletc. Then for processing HTML and getting links, you would want to use parsers like BeautifulSoup.

通常，您使用urllib、urllib2（ htmllib 等）在 Python 中进行 Web 编程。你也可以使用mechanize、curl等。然后为了处理 HTML 和获取链接，你会想要使用像BeautifulSoup这样的解析器。

Answer 3

回答by Frederic Bazin

try scrapy , the most comprehensive web extraction framework.

试试scrapy，最全面的网页提取框架。

http://scrapy.org

python 如何编写python脚本来搜索网站html以获取匹配链接

提问by GeminiDNK

回答by Nick Presta

回答by ghostdog74

回答by Frederic Bazin

相关推荐

最近更新

标签

python 如何编写python脚本来搜索网站html以获取匹配链接

提问by GeminiDNK

回答by Nick Presta

回答by ghostdog74

回答by Frederic Bazin

相关推荐

python 在python中从pickling中排除对象的字段

python 如何在不指定文件扩展名的情况下运行python脚本（跨平台解决方案）？

python numpy 数组中的自定义数据类型

python cElementtree 和 ElementTree 有什么区别？

相关推荐

最近更新

标签