python BeautifulSoup - 修改一段 HTML 中的所有链接?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/459981/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:10:14  来源:igfitidea点击:

BeautifulSoup - modifying all links in a piece of HTML?

pythonbeautifulsoup

提问by Evan Fosmark

I need to be able to modify every single link in an HTML document. I know that I need to use the SoupStrainerbut I'm not 100% positive on how to implement it. If someone could direct me to a good resource or provide a code example, it'd be very much appreciated.

我需要能够修改 HTML 文档中的每个链接。我知道我需要使用 ,SoupStrainer但我对如何实现它并不是 100% 肯定。如果有人可以指导我找到一个好的资源或提供代码示例,我将不胜感激。

Thanks.

谢谢。

回答by Lusid

Maybe something like this would work? (I don't have a Python interpreter in front of me, unfortunately)

也许这样的事情会奏效?(不幸的是,我面前没有 Python 解释器)

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Blah blah blah <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a'):
  a['href'] = a['href'].replace("google", "mysite")

result = str(soup)

回答by Evan Fosmark

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Blah blah blah <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a'):
    a['href'] = a['href'].replace("google", "mysite")
print str(soup)

This is Lusid's solution, but since he didn't have a Python interpreter in front of him, he wasn't able to test it and it had a few errors. I just wanted to post the working condition. Thank's Lusid!

这是 Lusid 的解决方案,但由于他面前没有 Python 解释器,他无法对其进行测试,并且出现了一些错误。我只是想发布工作条件。谢谢卢西德!

回答by Aziz Alto

I tried this and it worked, it's easier to avoid using regexp for matching each 'href':

我试过了,它奏效了,更容易避免使用正则表达式来匹配每个'href'

from bs4 import BeautifulSoup as bs
soup = bs(htmltext)
for a in soup.findAll('a'):
    a['href'] = "mysite"

Check it out, on bs4 docs.

bs4 docs查看