Python + BeautifulSoup:如何获取“a”元素的“href”属性?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43814754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:24:57  来源:igfitidea点击:

Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

pythonhtmlweb-scrapingbeautifulsoup

提问by t.m.adam

I have the following:

我有以下几点:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of hrefwhich is /file-one/additional. So I did:

并想获得href其中的文本是/file-one/additional. 所以我做了:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a', href=True, text=True):
    link_text = a[‘href']

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

但它只是打印一个空白,什么都没有。只是Link:。所以我在另一个网站上测试了它,但使用了不同的 HTML,它奏效了。

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

我可能做错了什么?或者该站点是否有可能故意编程为不返回href

Thank you in advance and will be sure to upvote/accept answer!

在此先感谢您,并一定会投票/接受答案!

回答by t.m.adam

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that textis None, and .find_all()fails to select the tag. Generally do not use the textparameter if a tag contains any other html elements except text content.

html 中的 'a' 标签不直接包含任何文本,但它包含一个带有文本的 'h3' 标签。这意味着它text是 None,并且.find_all()无法选择标签。text如果标签包含除文本内容之外的任何其他 html 元素,通常不要使用该参数。

You can resolve this issue if you use only the tag's name (and the hrefkeyword argument) to select elements. Then add a condition in the loop to check if they contain text.

如果您仅使用标签的名称(和href关键字参数)来选择元素,则可以解决此问题。然后在循环中添加一个条件以检查它们是否包含文本。

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

或者你可以使用列表理解,如果你更喜欢单行。

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambdato .find_all().

或者你可以传递一个lambdato .find_all()

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)


If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the hrefargument.

如果您想收集所有链接,无论它们是否包含文本,只需选择所有具有 'href' 属性的 'a' 标签。锚标签通常有链接,但这不是必需的,所以我认为最好使用href参数。

Using .find_all().

使用.find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select()with CSS selectors.

.select()与 CSS 选择器一起使用。

links = [a['href'] for a in soup.select('a[href]')]

回答by whackamadoodle3000

  1. First of all, use a different text editor that doesn't use curly quotes.

  2. Second, remove the text=Trueflag from the soup.find_all

  1. 首先,使用不使用大引号的不同文本编辑器。

  2. 其次,从text=True标志上取下soup.find_all

回答by Rakshit Vats

You can also use attrs to get the href tag with regex search

您还可以使用 attrs 通过正则表达式搜索获取 href 标签

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']