Python BeautifulSoup:如何从包含一些嵌套 <ul> 的 <ul> 列表中提取所有 <li>?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4362981/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?
提问by danneu
My source code looks like:
我的源代码如下所示:
<h3>Header3 (Start here)</h3>
<ul>
<li>List items</li>
<li>Etc...</li>
</ul>
<h3>Header 3</h3>
<ul>
<li>List items</li>
<ul>
<li>Nested list items</li>
<li>Nested list items</li></ul>
<li>List items</li>
</ul>
<h2>Header 2 (end here)</h2>
I'd like all the "li" tags following the first "h3" tag and stopping at the next "h2" tag, including all nested li tags.
我想要第一个“h3”标签之后的所有“li”标签,并在下一个“h2”标签处停止,包括所有嵌套的 li 标签。
firstH3 = soup.find('h3')
firstH3 = 汤.find('h3')
correctly finds the place I'd like to start.
正确找到我想开始的地方。
firstH3 = soup.find('h3') # Start here
uls = []
for nextSibling in firstH3.findNextSiblings():
if nextSibling.name == 'h2':
break
if nextSibling.name == 'ul':
uls.append(nextSibling)
gives me a list of ULs, each with LI contents that I need.
给了我一个 UL 列表,每个 UL 都有我需要的 LI 内容。
EXCERPT OF THE "uls" LIST:
“uls”列表的摘录:
<ul>
...
<li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
<li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
<li>Air Bud series:
<ul>
<li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
<li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
<li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
<li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
</ul>
</li>
<li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
...
</ul>
But I'm unsure of where to go from here. I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_filmand extracts a list of "Movie Title (Year)".
但我不确定从这里去哪里。我是一个新手程序员,试图通过构建一个脚本来跳转到 Python,该脚本抓取http://en.wikipedia.org/wiki/2000s_in_film并提取“电影标题(年份)”列表。
Update:
更新:
Final Code:
最终代码:
lis = []
for ul in uls:
for li in ul.findAll('li'):
if li.find('ul'):
break
lis.append(li)
for li in lis:
print li.text.encode("utf-8")
The If-->break throws out the LI's that contain UL's since the nested LI's are now duplicated.
If-->break 会抛出包含 UL 的 LI,因为嵌套的 LI 现在是重复的。
Print output is now:
打印输出现在是:
- 102 Dalmatians(2000)
- 10th & Wolf(2006)
- 11:14(2006)
- 12:08 East of Bucharest(2006)
- 13 Going on 30(2004)
- 1408(2007)
- ...
- 102 斑点狗 (2000)
- 10th & Wolf(2006)
- 11:14(2006)
- 12:08 布加勒斯特以东(2006)
- 13 继续 30(2004)
- 1408(2007)
- ...
Thanks
谢谢
采纳答案by jfs
.findAll()works for nested lielements:
.findAll()适用于嵌套li元素:
for ul in uls:
for li in ul.findAll('li'):
print(li)
Output:
输出:
<li>List items</li>
<li>Etc...</li>
<li>List items</li>
<li>Nested list items</li>
<li>Nested list items</li>
<li>List items</li>
回答by zachwill
A list comprehension could work, too.
列表理解也可以。
lis = [li for ul in uls for li in ul.findAll('li')]

