python 如何编写python脚本来搜索网站html以获取匹配链接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2376798/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to write a python script to search a website html for matching links
提问by GeminiDNK
I am not too familiar with python and have to write a script to perform a host of functions. Basically the module i still need is how to check a website code for matching links provided beforehand.
我对python不太熟悉,必须编写一个脚本来执行许多功能。基本上我仍然需要的模块是如何检查网站代码以获取预先提供的匹配链接。
回答by Nick Presta
Matching links what? Their HREF attribute? The link display text? Perhaps something like:
匹配链接是什么?他们的 HREF 属性?链接显示文本?也许是这样的:
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2
doc = urllib2.urlopen("http://somesite.com").read()
links = SoupStrainer('a', href=re.compile(r'^test'))
soup = [str(elm) for elm in BeautifulSoup(doc, parseOnlyThese=links)]
for elm in soup:
print elm
That will grab the HTML content of somesite.com
and then parse it using BeautifulSoup, looking only for links whose HREF attribute starts with "test". It then builds a list of these links and prints them out.
这将抓取 的 HTML 内容,somesite.com
然后使用 BeautifulSoup 解析它,只查找 HREF 属性以“test”开头的链接。然后它会构建这些链接的列表并将它们打印出来。
You can modify this to do anything using the documentation.
您可以修改它以使用文档做任何事情。
回答by ghostdog74
Generally, you use urllib, urllib2(htmllib etc) for programming web in Python. you could also use mechanize, curletc. Then for processing HTML and getting links, you would want to use parsers like BeautifulSoup.
通常,您使用urllib、urllib2( htmllib 等)在 Python 中进行 Web 编程。你也可以使用mechanize、curl等。然后为了处理 HTML 和获取链接,你会想要使用像BeautifulSoup这样的解析器。
回答by Frederic Bazin
try scrapy , the most comprehensive web extraction framework.
试试scrapy,最全面的网页提取框架。