Python + BeautifulSoup:如何获取“a”元素的“href”属性?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43814754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?
提问by t.m.adam
I have the following:
我有以下几点:
html =
'''<div class=“file-one”>
<a href=“/file-one/additional” class=“file-link">
<h3 class=“file-name”>File One</h3>
</a>
<div class=“location”>
Down
</div>
</div>'''
And would like to get just the text of href
which is /file-one/additional
. So I did:
并想获得href
其中的文本是/file-one/additional
. 所以我做了:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = “”
for a in soup.find_all(‘a', href=True, text=True):
link_text = a[‘href']
print “Link: “ + link_text
But it just prints a blank, nothing. Just Link:
. So I tested it out on another site but with a different HTML, and it worked.
但它只是打印一个空白,什么都没有。只是Link:
。所以我在另一个网站上测试了它,但使用了不同的 HTML,它奏效了。
What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href
?
我可能做错了什么?或者该站点是否有可能故意编程为不返回href
?
Thank you in advance and will be sure to upvote/accept answer!
在此先感谢您,并一定会投票/接受答案!
回答by t.m.adam
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text
is None, and .find_all()
fails to select the tag. Generally do not use the text
parameter if a tag contains any other html elements except text content.
html 中的 'a' 标签不直接包含任何文本,但它包含一个带有文本的 'h3' 标签。这意味着它text
是 None,并且.find_all()
无法选择标签。text
如果标签包含除文本内容之外的任何其他 html 元素,通常不要使用该参数。
You can resolve this issue if you use only the tag's name (and the href
keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
如果您仅使用标签的名称(和href
关键字参数)来选择元素,则可以解决此问题。然后在循环中添加一个条件以检查它们是否包含文本。
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.
或者你可以使用列表理解,如果你更喜欢单行。
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda
to .find_all()
.
或者你可以传递一个lambda
to .find_all()
。
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href
argument.
如果您想收集所有链接,无论它们是否包含文本,只需选择所有具有 'href' 属性的 'a' 标签。锚标签通常有链接,但这不是必需的,所以我认为最好使用href
参数。
Using .find_all()
.
使用.find_all()
.
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select()
with CSS selectors.
.select()
与 CSS 选择器一起使用。
links = [a['href'] for a in soup.select('a[href]')]
回答by whackamadoodle3000
First of all, use a different text editor that doesn't use curly quotes.
Second, remove the
text=True
flag from thesoup.find_all
首先,使用不使用大引号的不同文本编辑器。
其次,从
text=True
标志上取下soup.find_all
回答by Rakshit Vats
You can also use attrs to get the href tag with regex search
您还可以使用 attrs 通过正则表达式搜索获取 href 标签
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']