Python BeautifulSoup 提取元素之间的文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16835449/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python BeautifulSoup extract text between element
提问by ???nq ???lo?
I try to extract "THIS IS MY TEXT" from the following HTML:
我尝试从以下 HTML 中提取“这是我的文本”:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
I tried it this way:
我是这样试的:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
But I get all the text between all nested Tags plus the comment.
但是我得到了所有嵌套标签和评论之间的所有文本。
Can anyone help me to just get "THIS IS MY TEXT" out of this?
任何人都可以帮助我从中获得“这是我的文字”吗?
回答by Martijn Pieters
Use .childreninstead:
使用.children来代替:
from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children
if isinstance(child, NavigableString) and not isinstance(child, Comment))
Yes, this is a bit of a dance.
是的,这有点像跳舞。
Output:
输出:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print ''.join(unicode(child) for child in hit.children
... if isinstance(child, NavigableString) and not isinstance(child, Comment))
...
THIS IS MY TEXT
回答by TerryA
回答by kiriloff
Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tagsand NavigableStrings(as THIS IS A TEXT). An example
了解有关如何在BeautifulSoup. 解析树有tags和NavigableStrings(因为这是一个文本)。一个例子
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contentsand string.
要向下移动解析树,您有contents和string。
contents is an ordered list of the Tag and NavigableString objects contained within a page element
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]
内容是包含在页面元素中的 Tag 和 NavigableString 对象的有序列表
如果一个标签只有一个子节点,并且该子节点是一个字符串,则子节点将作为 tag.string 以及 tag.contents[0] 可用
For the above, that is to say you can get
对于上面的,也就是说你可以得到
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
For several children nodes, you can have for instance
对于多个子节点,您可以拥有例如
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contentsand get contents at the index you want.
所以在这里你可以contents在你想要的索引处玩和获取内容。
You also can iterate over a Tag, this is a shortcut. For instance,
你也可以迭代一个标签,这是一个快捷方式。例如,
for i in soup.body:
print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
回答by alireza sanaee
The BeautifulSoup documentationprovides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:
所述BeautifulSoup文档提供了有关从使用提取方法的文件删除对象的例子。在以下示例中,目的是从文档中删除所有注释:
Removing Elements
删除元素
Once you have a reference to an element, you can rip it out of the tree with the extract method. This code removes all the commentsfrom a document:
一旦有了对元素的引用,就可以使用提取方法将其从树中提取出来。此代码 从文档中删除所有注释:
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>
回答by Bennett Brown
Short answer: soup.findAll('p')[0].next
简短的回答: soup.findAll('p')[0].next
Real answer: You need an invariant reference point from which you can get to your target.
真正的答案:您需要一个不变的参考点,从中可以到达目标。
You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.
您在对 Haidro 的回答的评论中提到,您想要的文本并不总是在同一个地方。找出它相对于某个元素在同一位置的感觉。然后弄清楚如何让 BeautifulSoup 沿着该不变路径导航解析树。
For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p')will find paragraph elements, soup.find('p')[0]will be the first paragraph element.
例如,在您在原始帖子中提供的 HTML 中,目标字符串立即出现在第一个段落元素之后,并且该段落不为空。既然findAll('p')会找到段落元素,soup.find('p')[0]就会成为第一个段落元素。
You could in this case use soup.find('p')but soup.findAll('p')[n]is more general since maybe your actual scenario needs the 5th paragraph or something like that.
在这种情况下,您可以使用soup.find('p')但soup.findAll('p')[n]更通用,因为您的实际场景可能需要第 5 段或类似的内容。
The nextfield attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].nextcontains the text of the paragraph, and soup.findAll('p')[0].next.nextwill return your target in the HTML provided.
该next字段属性将在树中,包括儿童在内的下一个解析的元素。所以soup.findAll('p')[0].next包含段落的文本,并将soup.findAll('p')[0].next.next在提供的 HTML 中返回您的目标。
回答by Gregory Kremler
with your own soup object:
用你自己的汤对象:
soup.p.next_sibling.strip()
- you grab the <p> directly with
soup.p*(this hinges on it being the first <p> in the parse tree) - then use
next_siblingon the tag object thatsoup.preturns since the desired text is nested at the same level of the parse tree as the <p> .strip()is just a Python str method to remove leading and trailing whitespace
- 你直接用
soup.p*获取 <p> (这取决于它是解析树中的第一个 <p>) - 然后
next_sibling在soup.p返回的标记对象上使用,因为所需的文本与 <p> 嵌套在解析树的同一级别 .strip()只是一个用于删除前导和尾随空格的 Python str 方法
*otherwise just findthe element using your choice of filter(s)
in the interpreter this looks something like:
在解释器中,这看起来像:
In [4]: soup.p
Out[4]: <p>something</p>
In [5]: type(soup.p)
Out[5]: bs4.element.Tag
In [6]: soup.p.next_sibling
Out[6]: u'\n THIS IS MY TEXT\n '
In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString
In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'
In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode
回答by Naiswita
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
hit = hit.text.strip()
print hit
This will print: THIS IS MY TEXT Try this..
这将打印:这是我的文本试试这个..

