Python BeautifulSoup 中 get_text() 的建议

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16121001/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:48:31  来源:igfitidea点击:

Suggestions on get_text() in BeautifulSoup

pythonbeautifulsoup

提问by user601836

I am using BeautifulSoup to parse some content from a html page.

我正在使用 BeautifulSoup 来解析 html 页面中的一些内容。

I can extract from the html the content I want (i.e. the text contained in a spandefined by the classmyclass).

我可以从 html 中提取我想要的内容(即包含在spanclassmyclass定义的a 中的文本)。

result = mycontent.find(attrs={'class':'myclass'})

I obtain this result:

我得到这个结果:

<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

If I try to extract the text using:

如果我尝试使用以下方法提取文本:

result.get_text()

I obtain:

我获得:

Lorem ipsumdolor sit amet,consectetur...

As you can see when the tag <br>is removed there is no more spacing between the contents and two words are concated.

正如您所看到的,当标签<br>被移除时,内容之间没有更多的间距,两个单词被连接起来。

How can I solve this issue?

我该如何解决这个问题?

采纳答案by Sean Vieira

If you are using bs4 you can use strings:

如果您使用的是 bs4,您可以使用strings

" ".join(result.strings)

回答by Floris

Use 'contents' , then replace <br>?

使用 'contents' ,然后替换<br>?

Here is a full (working, tested) example:

这是一个完整的(工作的,经过测试的)示例:

from bs4 import BeautifulSoup
import urllib2

url="http://www.floris.us/SO/bstest.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

result = soup.find(attrs={'class':'myclass'})
print "The result of soup.find:"
print result

print "\nresult.contents:"
print result.contents
print "\nresult.get_text():"
print result.get_text()
for r in result:
  if (r.string is None):
    r.string = ' '

print "\nAfter replacing all the 'None' with ' ':"
print result.get_text()

Result:

结果:

The result of soup.find:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

result.contents:
[u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']

result.get_text():
Lorem ipsumdolor sit amet,consectetur...

After replacing all the 'None' with ' ':
Lorem ipsum dolor sit amet, consectetur...

This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the <br/>is its own element in the result.contentstuple, but when converted to string there's "nothing left".

这比 Sean 的非常紧凑的解决方案更复杂 - 但既然我说过我会按照我指示的方式创建和测试一个解决方案,我决定兑现我的承诺。你可以更好地看到这里发生的事情 - 这<br/>result.contents元组中它自己的元素,但是当转换为字符串时,“什么都没有”。

回答by explorer

result.get_text(separator=" ")should work.

result.get_text(separator=" ")应该管用。