Python BeautifulSoup soup.prettify() 给出了奇怪的输出
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20906416/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeautifulSoup soup.prettify() gives strange output
提问by aburak
I'm trying to parse a web site and I'm going to use it later in my Django project. To do that, I'm using urllib2 and BeautifulSoup4. However, I couldn't get what I want. The output of BeautifulSoup object is weird. I tried different pages, it worked (output is normal). I thought it is because of the page. Then, when my friend tried to do the same thing, he got normal output. I couldn't manage to figure out problem.
我正在尝试解析一个网站,稍后我将在我的 Django 项目中使用它。为此,我使用了 urllib2 和 BeautifulSoup4。然而,我无法得到我想要的。BeautifulSoup 对象的输出很奇怪。我尝试了不同的页面,它有效(输出正常)。我以为是页面的原因。然后,当我的朋友尝试做同样的事情时,他得到了正常的输出。我无法弄清楚问题。
This is the websiteI'm going to parse.
这是我要解析的网站。
This is an example of the weird output after the command "soup.prettify()":
这是命令“soup.prettify()”后奇怪输出的示例:
t d B G C O L O R = " # 9 9 0 4 0 4 " w i d t h = " 3 " > i m g S R C = " 1 p . g i f " A L T B O R D E R = " 0 " h e i g h t = " 1 " w i d t h = " 3 " > / t d > \n / t r > \n t r > \n t d c o l s p a n = " 3 " B G C O L O R = " # 9 9 0 4 0 4 " w i d t h = " 6 0 0 " h e i g h t = " 3 " > i m g s r c = " 1 p . g i f " w i d t h = " 6 0 0 " \n h e i g h t = " 1 " > / t d > \n / t r > \n / t a b l e > \n / c e n t e r > / d i v > \n \n p > &n b s p ; &n b s p ; &n b s p ; &n b s p ; / p > \n / b o d y > \n / h t m l >\n </p>\n </body>\n</html>'
回答by Hooked
Here is a minimal example that doeswork for me, including the snippet of html that you have a problem with. It's hard to tell without your code, but my guess is you did something like ' '.join(A.split())somewhere.
下面是一个小例子,确实为我工作,包括HTML的代码段,你有一个问题。没有你的代码很难说,但我猜你在' '.join(A.split())某个地方做了类似的事情。
import urllib2, bs4
url = "http://kafemud.bilkent.edu.tr/monu_tr.html"
req = urllib2.urlopen(url)
raw = req.read()
soup = bs4.BeautifulSoup(raw)
print soup.prettify().encode('utf-8')
Giving:
给予:
....
<td bgcolor="#990404" width="3">
<img alt="" border="0" src="1p.gif" width="3"/>
</td>
<td bgcolor="#FFFFFF" valign="TOP">
<div align="left">
<table align="left" border="0" cellpadding="10" cellspacing="0" valign="TOP" width="594">
<tr>
<td align="left" valign="top">
<table align="left" border="0" cellpadding="0" cellspacing="0" class="icerik" width="574">
....
回答by Tobias
Possibly you and your friend use different parsers. BeautifulSoup will use the parser it considers "best", and thus prefer lxmlfor speed reasons (if installed). If using recent versions of Python (and the last version of the included parser), there are cases which are handled better by BeautifulSoup(text, 'html.parser'); this is the case e.g. when there are unmasked <characters (instead of <) in text content.
可能您和您的朋友使用不同的解析器。BeautifulSoup 将使用它认为“最佳”的解析器,因此lxml出于速度原因(如果已安装)更喜欢。如果使用最新版本的 Python(以及包含的解析器的最新版本),有些情况可以通过BeautifulSoup(text, 'html.parser'); 例如,当文本内容中有未屏蔽的<字符(而不是<)时就是这种情况。
回答by Theo Emms
This looks like you have your XML coming in with an encoding that beautifulsoup isn't expecting. My guess is that your XML is in UTF-16 and beautifulsoup is reading it as UTF-8. Python offers the .encode and .decode functions for switching between different encodings. Something like
看起来您的 XML 使用了 beautifulsoup 不期望的编码。我的猜测是您的 XML 是 UTF-16,而 beautifulsoup 将其读取为 UTF-8。Python 提供了 .encode 和 .decode 函数用于在不同编码之间切换。就像是
myXmlStr.encode("utf-16").decode("utf-8")
Would probably solve your problem if the issue is your incoming XML encoding. I'm new to beautiful soup myself, but a quick google suggests that if the problem is the encoding of the output, prettify accepts an encoding parameter:
如果问题是您的传入 XML 编码,则可能会解决您的问题。我自己是美丽汤的新手,但一个快速的谷歌建议,如果问题是输出的编码,则 prettify 接受一个编码参数:
soup.prettify("utf-16")
Without more information I can't give you a clearer answer - but hopefully this points you in a helpful direction.
没有更多信息,我无法给你一个更清晰的答案——但希望这会为你指明一个有用的方向。

