Python 使用 BeautifulSoup 提取没有标签的文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23380171/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:53:28  来源:igfitidea点击:

Using BeautifulSoup to extract text without tags

pythonweb-scrapingbeautifulsoup

提问by myloginid

My webpage looks like this:

我的网页是这样的:

<p>
  <strong class="offender">YOB:</strong> 1987<br/>
  <strong class="offender">RACE:</strong> WHITE<br/>
  <strong class="offender">GENDER:</strong> FEMALE<br/>
  <strong class="offender">HEIGHT:</strong> 5'05''<br/>
  <strong class="offender">WEIGHT:</strong> 118<br/>
  <strong class="offender">EYE COLOR:</strong> GREEN<br/>
  <strong class="offender">HAIR COLOR:</strong> BROWN<br/>
</p>

I want to extract the info for each individual and get YOB:1987, RACE:WHITE, etc...

我想提取每一个人的信息,并得到YOB:1987RACE:WHITE等...

What I tried is:

我试过的是:

subc = soup.find_all('p')
subc1 = subc[1]
subc2 = subc1.find_all('strong')

But this gives me only the values of YOB:, RACE:, etc...

但是,这给我的唯一的值YOB:RACE:等...

Is there a way that I can get the data in YOB:1987, RACE:WHITEformat?

有没有办法可以获取YOB:1987,RACE:WHITE格式的数据?

采纳答案by shaktimaan

Just loop through all the <strong>tags and use next_siblingto get what you want. Like this:

只需遍历所有<strong>标签并使用next_sibling即可获得所需内容。像这样:

for strong_tag in soup.find_all('strong'):
    print(strong_tag.text, strong_tag.next_sibling)

Demo:

演示:

from bs4 import BeautifulSoup

html = '''
<p>
  <strong class="offender">YOB:</strong> 1987<br />
  <strong class="offender">RACE:</strong> WHITE<br />
  <strong class="offender">GENDER:</strong> FEMALE<br />
  <strong class="offender">HEIGHT:</strong> 5'05''<br />
  <strong class="offender">WEIGHT:</strong> 118<br />
  <strong class="offender">EYE COLOR:</strong> GREEN<br />
  <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''

soup = BeautifulSoup(html)

for strong_tag in soup.find_all('strong'):
    print(strong_tag.text, strong_tag.next_sibling)

This gives you:

这给你:

YOB:  1987
RACE:  WHITE
GENDER:  FEMALE
HEIGHT:  5'05''
WEIGHT:  118
EYE COLOR:  GREEN
HAIR COLOR:  BROWN

回答by 0605002

I think you can get it using subc1.text.

我认为您可以使用subc1.text.

>>> html = """
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text


YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

Or if you want to exploreit, you can use .contents:

或者,如果你想探索它,你可以使用.contents

>>> p = soup.find('p')
>>> from pprint import pprint
>>> pprint(p.contents)
[u'\n',
 <strong class="offender">YOB:</strong>,
 u' 1987',
 <br/>,
 u'\n',
 <strong class="offender">RACE:</strong>,
 u' WHITE',
 <br/>,
 u'\n',
 <strong class="offender">GENDER:</strong>,
 u' FEMALE',
 <br/>,
 u'\n',
 <strong class="offender">HEIGHT:</strong>,
 u" 5'05''",
 <br/>,
 u'\n',
 <strong class="offender">WEIGHT:</strong>,
 u' 118',
 <br/>,
 u'\n',
 <strong class="offender">EYE COLOR:</strong>,
 u' GREEN',
 <br/>,
 u'\n',
 <strong class="offender">HAIR COLOR:</strong>,
 u' BROWN',
 <br/>,
 u'\n']

and filter out the necessary items from the list:

并从列表中过滤出必要的项目:

>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]]))
>>> pprint(data)
{u'EYE COLOR:': u'GREEN',
 u'GENDER:': u'FEMALE',
 u'HAIR COLOR:': u'BROWN',
 u'HEIGHT:': u"5'05''",
 u'RACE:': u'WHITE',
 u'WEIGHT:': u'118',
 u'YOB:': u'1987'}