Python BeautifulSoup 中 get_text() 的建议

Question

提问by user601836

I am using BeautifulSoup to parse some content from a html page.

我正在使用 BeautifulSoup 来解析 html 页面中的一些内容。

I can extract from the html the content I want (i.e. the text contained in a spandefined by the classmyclass).

我可以从 html 中提取我想要的内容（即包含在span由classmyclass定义的a 中的文本）。

result = mycontent.find(attrs={'class':'myclass'})

I obtain this result:

我得到这个结果：

<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

If I try to extract the text using:

如果我尝试使用以下方法提取文本：

result.get_text()

I obtain:

我获得：

Lorem ipsumdolor sit amet,consectetur...

As you can see when the tag  is removed there is no more spacing between the contents and two words are concated.

正如您所看到的，当标签 被移除时，内容之间没有更多的间距，两个单词被连接起来。

How can I solve this issue?

我该如何解决这个问题？

Answer 1

采纳答案by Sean Vieira

If you are using bs4 you can use strings:

如果您使用的是 bs4，您可以使用strings：

" ".join(result.strings)

Answer 2

回答by Floris

Use 'contents' , then replace  ?

使用 'contents' ，然后替换 ?

Here is a full (working, tested) example:

这是一个完整的（工作的，经过测试的）示例：

from bs4 import BeautifulSoup
import urllib2

url="http://www.floris.us/SO/bstest.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

result = soup.find(attrs={'class':'myclass'})
print "The result of soup.find:"
print result

print "\nresult.contents:"
print result.contents
print "\nresult.get_text():"
print result.get_text()
for r in result:
  if (r.string is None):
    r.string = ' '

print "\nAfter replacing all the 'None' with ' ':"
print result.get_text()

Result:

结果：

The result of soup.find:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

result.contents:
[u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']

result.get_text():
Lorem ipsumdolor sit amet,consectetur...

After replacing all the 'None' with ' ':
Lorem ipsum dolor sit amet, consectetur...

This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the  is its own element in the result.contentstuple, but when converted to string there's "nothing left".

这比 Sean 的非常紧凑的解决方案更复杂 - 但既然我说过我会按照我指示的方式创建和测试一个解决方案，我决定兑现我的承诺。你可以更好地看到这里发生的事情 - 这 是result.contents元组中它自己的元素，但是当转换为字符串时，“什么都没有”。

Answer 3

回答by explorer

result.get_text(separator=" ")should work.

result.get_text(separator=" ")应该管用。

Python BeautifulSoup 中 get_text() 的建议

提问by user601836

采纳答案by Sean Vieira

回答by Floris

回答by explorer

相关推荐

最近更新

标签

Python BeautifulSoup 中 get_text() 的建议

提问by user601836

采纳答案by Sean Vieira

回答by Floris

回答by explorer

相关推荐

Python numpy 获取值为 true 的索引

Python 是否有专门针对 PyQt5 的教程？

Python 通过整数索引选择一行熊猫系列/数据框

如何在 python 中使用 selenium web 驱动程序获取文本

相关推荐

最近更新

标签