Python BeautifulSoup：如何从包含一些嵌套 <ul> 的 <ul> 列表中提取所有 <li>？

Question

提问by danneu

My source code looks like:

我的源代码如下所示：

<h3>Header3 (Start here)</h3>
<ul>
    <li>List items</li>
    <li>Etc...</li>
</ul>
<h3>Header 3</h3>
<ul>
    <li>List items</li>
    <ul>
        <li>Nested list items</li>
        <li>Nested list items</li></ul>
    <li>List items</li>
</ul>
<h2>Header 2 (end here)</h2>

I'd like all the "li" tags following the first "h3" tag and stopping at the next "h2" tag, including all nested li tags.

我想要第一个“h3”标签之后的所有“li”标签，并在下一个“h2”标签处停止，包括所有嵌套的 li 标签。

firstH3 = soup.find('h3')

firstH3 = 汤.find('h3')

correctly finds the place I'd like to start.

正确找到我想开始的地方。

firstH3 = soup.find('h3') # Start here
uls = []
for nextSibling in firstH3.findNextSiblings():
    if nextSibling.name == 'h2':
        break
    if nextSibling.name == 'ul':
        uls.append(nextSibling)

gives me a list of ULs, each with LI contents that I need.

给了我一个 UL 列表，每个 UL 都有我需要的 LI 内容。

EXCERPT OF THE "uls" LIST:

“uls”列表的摘录：

<ul>
...
    <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
    <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
    <li>Air Bud series:
        <ul>
            <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
            <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
            <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
            <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
        </ul>
    </li>
    <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
...
</ul>

But I'm unsure of where to go from here. I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_filmand extracts a list of "Movie Title (Year)".

但我不确定从这里去哪里。我是一个新手程序员，试图通过构建一个脚本来跳转到 Python，该脚本抓取http://en.wikipedia.org/wiki/2000s_in_film并提取“电影标题（年份）”列表。

Update:

更新：

Final Code:

最终代码：

lis = []
    for ul in uls:
        for li in ul.findAll('li'):
            if li.find('ul'):
                break
            lis.append(li)

    for li in lis:
        print li.text.encode("utf-8")

The If-->break throws out the LI's that contain UL's since the nested LI's are now duplicated.

If-->break 会抛出包含 UL 的 LI，因为嵌套的 LI 现在是重复的。

Print output is now:

打印输出现在是：

102 Dalmatians(2000)
10th & Wolf(2006)
11:14(2006)
12:08 East of Bucharest(2006)
13 Going on 30(2004)
1408(2007)
...

102 斑点狗 (2000)
10th & Wolf(2006)
11:14(2006)
12:08 布加勒斯特以东（2006）
13 继续 30(2004)
1408(2007)
...

Thanks

谢谢

Answer 1

采纳答案by jfs

.findAll()works for nested lielements:

.findAll()适用于嵌套li元素：

for ul in uls:
    for li in ul.findAll('li'):
        print(li)

Output:

输出：

<li>List items</li>
<li>Etc...</li>
<li>List items</li>
<li>Nested list items</li>
<li>Nested list items</li>
<li>List items</li>

Answer 2

回答by zachwill

A list comprehension could work, too.

列表理解也可以。

lis = [li for ul in uls for li in ul.findAll('li')]

Python BeautifulSoup：如何从包含一些嵌套 <ul> 的 <ul> 列表中提取所有 <li>？

提问by danneu

采纳答案by jfs

回答by zachwill

相关推荐

最近更新

标签

Python BeautifulSoup：如何从包含一些嵌套 <ul> 的 <ul> 列表中提取所有 <li>？

提问by danneu

采纳答案by jfs

回答by zachwill

相关推荐

python -m SimpleHTTPServer - 监听 0.0.0.0:8000 但 http://0.0.0.0:8000/test.html 给出“找不到页面”

Python 如何禁用 Pylint 警告？

如何将 Python 字典序列化为字符串，然后再返回到字典？

如何在不修改任何一个的情况下在 Python 中连接两个列表？

相关推荐

最近更新

标签