python 如何从 BeautifulSoup 中获取 CData

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2032172/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:37:32  来源:igfitidea点击:

How can i grab CData out of BeautifulSoup

pythonscreen-scrapingbeautifulsoupcdata

提问by hary wilke

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.

我有一个我正在抓取的网站,它具有以下类似的结构。我希望能够从 CData 块中获取信息。

I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.

我正在使用 BeautifulSoup 从页面上提取其他信息,所以如果解决方案可以使用它,它将有助于降低我的学习曲线,因为我是一个 Python 新手。具体来说,我想获取隐藏在 CData 语句中的两种不同类型的数据。第一个只是文本我很确定我可以在它上面抛出一个正则表达式并得到我需要的东西。对于第二种类型,如果我可以将包含 html 元素的数据放入它自己的 beautifulsoup 中,我可以解析它。

I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.

我只是在学习 python 和 beautifulsoup,所以我正在努力寻找神奇的咒语,它可以单独给我 CData。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">  
<head>  
<title>
   Cows and Sheep
  </title>
</head>
<body>
 <div id="main">
  <div id="main-precontents">
   <div id="main-contents" class="main-contents">
    <script type="text/javascript">
       //<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
       <!--ts-->
       get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
       <!--yy-->
       <span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
       <!--?5695:5:40:45-->
       ';
        //]]>
      </script>
     </div>
     </div>
    </div>
 </body>
</html>

回答by Alex Martelli

BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

BeautifulSoup 将 CData 视为“可导航字符串”的特例(子类)。例如:

import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd

In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

在您的情况下,您当然可以查看从带有“主要内容”ID 的 div 开始的子树,而不是整个文档树。

回答by iMath

One thing you need to be careful of BeautifulSoup grabbing CDatais not to use a lxml parser.

您需要注意BeautifulSoup 抓取 CData 的一件事是不要使用 lxml 解析器。

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

默认情况下,lxml 解析器将从树中剥离 CDATA 部分并将其替换为纯文本内容,在此处了解更多信息

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>> 

回答by RJ Regenold

You could try this:

你可以试试这个:

from BeautifulSoup import BeautifulSoup

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]

That should give you the contents of cdata.

那应该给你 cdata 的内容。

Update

更新

This may be a little cleaner:

这可能更干净一点:

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

Just personal preference, but I like the bottom one a little better.

只是个人喜好,但我更喜欢底部的一点。

回答by Aben

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)
for x in soup.find_all('item'):
    print re.sub('[\[CDATA\]]', '', x.string)

回答by newimprovement

For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:

对于使用 BeautifulSoup4 的任何人,Alex Martelli 的解决方案都有效,但请执行以下操作:

from bs4 import BeautifulSoup, CData

soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, Cdata):
    print 'CData contents: %r' % cd