Python 仅从此元素中提取文本，而不是其子元素

Question

提问by Dragon

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:

我只想从汤的最顶层元素中提取文本；然而，soup.text 也给出了所有子元素的文本：

I have

我有

import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text

The output to this is yesno. I want simply 'yes'.

对此的输出是yesno. 我只想说“是”。

What's the best way of achieving this?

实现这一目标的最佳方法是什么？

Edit: I also want yesto be output when parsing '<html><b>no</b>yes</html>'.

编辑：我也想yes在解析' <html><b>no</b>yes</html>'时输出。

Answer 1

采纳答案by jbochi

what about .find(text=True)?

怎么样.find(text=True)？

>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'

EDIT:

编辑：

I think that I've understood what you want now. Try this:

我想我已经明白你现在想要什么了。尝试这个：

>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'

Answer 2

回答by TigrisC

You could use contents

你可以使用内容

>>> print soup.html.contents[0]
yes

or to get all the texts under html, use findAll(text=True, recursive=False)

或者要获取 html 下的所有文本，请使用 findAll(text=True, recursive=False)

>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False) 
[u'x', u'yes']

above joined to form a single string

上面连接形成一个字符串

>>> ''.join(soup.html.findAll(text=True, recursive=False)) 
u'xyes'

Answer 3

回答by mzjn

You might want to look into lxml's soupparsermodule, which has support for XPath:

您可能需要查看 lxml 的soundparser模块，它支持 XPath：

>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']

Answer 4

回答by Horst Miller

This works for me in bs4:

这在 bs4 中对我有用：

import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])

output:

输出：

AC

Python 仅从此元素中提取文本，而不是其子元素

提问by Dragon

采纳答案by jbochi

回答by TigrisC

回答by mzjn

回答by Horst Miller

相关推荐

最近更新

标签

Python 仅从此元素中提取文本，而不是其子元素

提问by Dragon

采纳答案by jbochi

回答by TigrisC

回答by mzjn

回答by Horst Miller

相关推荐

Python BeautifulSoup 和 lxml.html - 更喜欢什么？

如何在 Python 中获取字符串的大小？

Python Tkinter 和 Tix 的颜色图表

Python 使用imp动态导入模块

相关推荐

最近更新

标签