Python BeautifulSoup 提取元素之间的文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16835449/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:47:11  来源:igfitidea点击:

Python BeautifulSoup extract text between element

pythonbeautifulsoup

提问by ???nq ???lo?

I try to extract "THIS IS MY TEXT" from the following HTML:

我尝试从以下 HTML 中提取“这是我的文本”:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

I tried it this way:

我是这样试的:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

But I get all the text between all nested Tags plus the comment.

但是我得到了所有嵌套标签和评论之间的所有文本。

Can anyone help me to just get "THIS IS MY TEXT" out of this?

任何人都可以帮助我从中获得“这是我的文字”吗?

回答by Martijn Pieters

Use .childreninstead:

使用.children来代替:

from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

Yes, this is a bit of a dance.

是的,这有点像跳舞。

Output:

输出:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT

回答by TerryA

You can use .contents:

您可以使用.contents

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT

回答by kiriloff

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tagsand NavigableStrings(as THIS IS A TEXT). An example

了解有关如何BeautifulSoup. 解析树有tagsNavigableStrings(因为这是一个文本)。一个例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

To move down the parse tree you have contentsand string.

要向下移动解析树,您有contentsstring

  • contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

  • 内容是包含在页面元素中的 Tag 和 NavigableString 对象的有序列表

  • 如果一个标签只有一个子节点,并且该子节点是一个字符串,则子节点将作为 tag.string 以及 tag.contents[0] 可用

For the above, that is to say you can get

对于上面的,也就是说你可以得到

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

For several children nodes, you can have for instance

对于多个子节点,您可以拥有例如

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

so here you may play with contentsand get contents at the index you want.

所以在这里你可以contents在你想要的索引处玩和获取内容。

You also can iterate over a Tag, this is a shortcut. For instance,

你也可以迭代一个标签,这是一个快捷方式。例如,

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

回答by alireza sanaee

The BeautifulSoup documentationprovides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:

所述BeautifulSoup文档提供了有关从使用提取方法的文件删除对象的例子。在以下示例中,目的是从文档中删除所有注释:

Removing Elements

删除元素

Once you have a reference to an element, you can rip it out of the tree with the extract method. This code removes all the commentsfrom a document:

一旦有了对元素的引用,就可以使用提取方法将其从树中提取出来。此代码 从文档中删除所有注释

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                    <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

回答by Bennett Brown

Short answer: soup.findAll('p')[0].next

简短的回答: soup.findAll('p')[0].next

Real answer: You need an invariant reference point from which you can get to your target.

真正的答案:您需要一个不变的参考点,从中可以到达目标。

You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.

您在对 Haidro 的回答的评论中提到,您想要的文本并不总是在同一个地方。找出它相对于某个元素在同一位置的感觉。然后弄清楚如何让 BeautifulSoup 沿着该不变路径导航解析树。

For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p')will find paragraph elements, soup.find('p')[0]will be the first paragraph element.

例如,在您在原始帖子中提供的 HTML 中,目标字符串立即出现在第一个段落元素之后,并且该段落不为空。既然findAll('p')会找到段落元素,soup.find('p')[0]就会成为第一个段落元素。

You could in this case use soup.find('p')but soup.findAll('p')[n]is more general since maybe your actual scenario needs the 5th paragraph or something like that.

在这种情况下,您可以使用soup.find('p')soup.findAll('p')[n]更通用,因为您的实际场景可能需要第 5 段或类似的内容。

The nextfield attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].nextcontains the text of the paragraph, and soup.findAll('p')[0].next.nextwill return your target in the HTML provided.

next字段属性将在树中,包括儿童在内的下一个解析的元素。所以soup.findAll('p')[0].next包含段落的文本,并将soup.findAll('p')[0].next.next在提供的 HTML 中返回您的目标。

回答by Gregory Kremler

with your own soup object:

用你自己的汤对象:

soup.p.next_sibling.strip()
  1. you grab the <p> directly with soup.p*(this hinges on it being the first <p> in the parse tree)
  2. then use next_siblingon the tag object that soup.preturns since the desired text is nested at the same level of the parse tree as the <p>
  3. .strip()is just a Python str method to remove leading and trailing whitespace
  1. 你直接用soup.p*获取 <p> (这取决于它是解析树中的第一个 <p>)
  2. 然后next_siblingsoup.p返回的标记对象上使用,因为所需的文本与 <p> 嵌套在解析树的同一级别
  3. .strip()只是一个用于删除前导和尾随空格的 Python str 方法

*otherwise just findthe element using your choice of filter(s)

*否则只是找到使用你选择的元素过滤器(一个或多个)

in the interpreter this looks something like:

在解释器中,这看起来像:

In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

回答by Naiswita

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
  hit = hit.text.strip()
  print hit

This will print: THIS IS MY TEXT Try this..

这将打印:这是我的文本试试这个..