Python BS4:在标签中获取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25251841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BS4: Getting text in tag
提问by Milano
I'm using beautiful soup. There is a tag like this:
我正在使用美丽的汤。有一个这样的标签:
<li><a href="example"> s.r.o., <small>small</small></a></li>
<li><a href="example"> s.r.o., <small>small</small></a></li>
I want to get the text within the anchor <a>tag only, without any from the <small>tag in the output; i.e. " s.r.o.,"
我只想获取锚<a>标记中的文本,而不是<small>输出中的任何标记;即“ s.r.o.,”
I tried find('li').text[0]but it does not work.
我试过了,find('li').text[0]但它不起作用。
Is there a command in BS4 which can do that?
BS4中是否有可以做到这一点的命令?
采纳答案by alecxe
One option would be to get the first element from the contentsof the aelement:
一个选择是从元素的contents中获取第一个a元素:
>>> from bs4 import BeautifulSoup
>>> data = '<li><a href="example"> s.r.o., <small>small</small></a></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
s.r.o.,
Another one would be to find the smalltag and get the previous sibling:
另一种方法是找到small标签并获取前一个兄弟:
>>> print soup.find('small').previous_sibling
s.r.o.,
Well, there are all sorts of alternative/crazy options also:
好吧,还有各种替代/疯狂的选择:
>>> print next(soup.find('a').descendants)
s.r.o.,
>>> print next(iter(soup.find('a')))
s.r.o.,
回答by Sumanth Lazarus
If you would like to loop to print all content of anchor tags located in html string/web page (must utilise urlopen from urllib), this works:
如果您想循环打印位于 html 字符串/网页中的锚标记的所有内容(必须使用来自 urllib 的 urlopen),这可以工作:
from bs4 import BeautifulSoup
data = '<li><a href="example">s.r.o., <small>small</small</a></li> <li><a href="example">2nd</a></li> <li><a href="example">3rd</a></li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
print(tag.contents[0]) #.contents method to locate text within <a> tags
Output:
输出:
s.r.o.,
2nd
3rd
a_tagis a list containing all anchor tags; collecting all anchor tags in a list, enables group editing (if more than one <a>tags present.
a_tag是一个包含所有锚标签的列表;收集列表中的所有锚标签,启用组编辑(如果存在多个<a>标签。
>>>print(a_tag)
[<a href="example">s.r.o., <small>small</small></a>, <a href="example">2nd</a>, <a href="example">3rd</a>]

