Python + BeautifulSoup：如何获取“a”元素的“href”属性？

Question

提问by t.m.adam

I have the following:

我有以下几点：

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of hrefwhich is /file-one/additional. So I did:

并想获得href其中的文本是/file-one/additional. 所以我做了：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a', href=True, text=True):
    link_text = a[‘href']

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

但它只是打印一个空白，什么都没有。只是Link:。所以我在另一个网站上测试了它，但使用了不同的 HTML，它奏效了。

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

我可能做错了什么？或者该站点是否有可能故意编程为不返回href？

Thank you in advance and will be sure to upvote/accept answer!

在此先感谢您，并一定会投票/接受答案！

Answer 1

回答by t.m.adam

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that textis None, and .find_all()fails to select the tag. Generally do not use the textparameter if a tag contains any other html elements except text content.

html 中的 'a' 标签不直接包含任何文本，但它包含一个带有文本的 'h3' 标签。这意味着它text是 None，并且.find_all()无法选择标签。text如果标签包含除文本内容之外的任何其他 html 元素，通常不要使用该参数。

You can resolve this issue if you use only the tag's name (and the hrefkeyword argument) to select elements. Then add a condition in the loop to check if they contain text.

如果您仅使用标签的名称（和href关键字参数）来选择元素，则可以解决此问题。然后在循环中添加一个条件以检查它们是否包含文本。

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

或者你可以使用列表理解，如果你更喜欢单行。

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambdato .find_all().

或者你可以传递一个lambdato .find_all()。

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the hrefargument.

如果您想收集所有链接，无论它们是否包含文本，只需选择所有具有 'href' 属性的 'a' 标签。锚标签通常有链接，但这不是必需的，所以我认为最好使用href参数。

Using .find_all().

使用.find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select()with CSS selectors.

.select()与 CSS 选择器一起使用。

links = [a['href'] for a in soup.select('a[href]')]

Answer 2

回答by whackamadoodle3000

First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=Trueflag from the soup.find_all

首先，使用不使用大引号的不同文本编辑器。
其次，从text=True标志上取下soup.find_all

Answer 3

回答by Rakshit Vats

You can also use attrs to get the href tag with regex search

您还可以使用 attrs 通过正则表达式搜索获取 href 标签

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

Python + BeautifulSoup：如何获取“a”元素的“href”属性？

提问by t.m.adam

回答by t.m.adam

回答by whackamadoodle3000

回答by Rakshit Vats

相关推荐

最近更新

标签

Python + BeautifulSoup：如何获取“a”元素的“href”属性？

提问by t.m.adam

回答by t.m.adam

回答by whackamadoodle3000

回答by Rakshit Vats

相关推荐

在 python 中使用 cv2.findContours() 时出错

Python Keras：如何在顺序模型中获取图层形状

Python spaCy 的词性和依赖标签是什么意思？

如何检查变量是python列表、numpy数组还是pandas系列

相关推荐

最近更新

标签