如何将 bs4.element.ResultSet 转换为字符串?Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20968562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:39:57  来源:igfitidea点击:

how to convert a bs4.element.ResultSet to strings? Python

pythonbeautifulsoupruntime-error

提问by samuraiexe

I have a simple code like:

我有一个简单的代码,如:

    p = soup.find_all("p")
    paragraphs = []

    for x in p:
        paragraphs.append(str(x))

I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:

我正在尝试转换从 xml 获得的列表并将其转换为字符串。我想保留它的原始标签,以便我可以重用一些文本,这就是我像这样附加它的原因。但是该列表包含超过 6000 个观察值,因此由于 str 出现递归错误:

"RuntimeError: maximum recursion depth exceeded while calling a Python object"

“运行时错误:调用 Python 对象时超出了最大递归深度”

I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?

我读到您可以更改最大递归,但这样做并不明智。我的下一个想法是将到字符串的转换分成 500 个批次,但我确信必须有更好的方法来做到这一点。有人有建议吗?

采纳答案by senshin

The problem here is probably that some of the binary graphic data at the bottom of the documentcontains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053]for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.

这里的问题可能是文档底部的一些二进制图形数据包含字符序列<P,Beautiful Soup 试图将其修复为实际的 HTML 标记。我还没有确定是哪个文本导致了“超出递归深度”错误,但它就在那里。这是p[6053]给我的,但由于您似乎对文件进行了一些修改(或者您可能对 Beautiful Soup 使用了不同的解析器),我想对您来说会有所不同。

Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual<p>tags, try this:

假设您不需要文档底部的二进制数据来从实际<p>标签中提取您需要的任何内容,请尝试以下操作:

# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()

p = soup.find_all('p')
paragraphs = []
for x in p:
    paragraphs.append(str(x))

回答by mattcan

I believe the issue is that the BeautifulsSoupobject pis not built iteratiely, therefore the method call limitis reached before you can finish constructing p = soup.find_all('p'). Note the RecursionErroris similarly thrown when building soup.prettify().

我认为问题在于BeautifulsSoup对象p不是迭代构建的,因此在您完成构建之前就达到了方法调用限制p = soup.find_all('p')。请注意,RecursionError在构建soup.prettify().

For my solution I used the remodule to gather all <p>...</p>tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.

对于我的解决方案,我使用该re模块来收集所有<p>...</p>标签(请参阅下面的代码)。我的最终结果是len(p) = 5571。此计数低于您的计数,因为正则表达式条件与二进制图形数据中的任何文本都不匹配。

import re
import urllib
from urllib.request import Request, urlopen

url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'

response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)

paragraphs = []
for x in p:
    paragraphs.append(str(x))