使用 BeautifulSoup 遍历 html 树中的元素,并生成一个保持每个元素相对位置的输出?在 Python 中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13736554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:30:44  来源:igfitidea点击:

Iterate through elements in html tree using BeautifulSoup, and produce an output that maintains the relative position of each element? in Python

pythonhtml-parsingweb-scrapingbeautifulsoupjsoup

提问by Christian

I have this code that does what I need it to do using Jsoup in Java

我有这段代码可以在 Java 中使用 Jsoup 完成我需要它做的事情

Elements htmlTree = doc.body().select("*");

    Elements menuElements = new Elements();

    for(Element element : htmlTree) {

        if(element.hasClass("header")) 
            menuElements.add(element);
        if(element.hasClass("name"))
            menuElements.add(element);
        if(element.hasClass("quantity"))
            menuElements.add(element);
    }

I want to do the same thing but in Python using BeautifulSoup. An example tree of the HTML I'm trying to scrape follows:

我想做同样的事情,但在 Python 中使用 BeautifulSoup。我试图抓取的 HTML 示例树如下:

<div class="header"> content </div>
     <div class="name"> content </div>
     <div class="quantity"> content </div>
     <div class="name"> content </div>
     <div class="quantity"> content </div>
<div class="header"> content2 </div>
     <div class="name"> content2 </div>
     <div class="quantity"> content2 </div>
     <div class="name"> content2 </div>
     <div class="quantity"> content2 </div>

etc.

等等。

Basically I want the output to preserve the relative positions of each element. How would I got about doing that using Python and BeautifulSoup?

基本上我希望输出保留每个元素的相对位置。我将如何使用 Python 和 BeautifulSoup 做到这一点?

EDIT:

编辑:

This is the python code I have (it's very naive) but maybe it can help?

这是我拥有的 python 代码(它非常天真)但也许它可以提供帮助?

output = []

for e in soup :
  if e["class"] == "pickmenucolmenucat" :
    output.append(e)
  if e["class"] == "pickmenucoldispname" :
    output.append(e)
  if e["class"] == "pickmenucolportions" :
    output.append(e)

采纳答案by jfs

To find all <div>elements that have classattribute from a given list:

要从给定列表中查找<div>具有class属性的所有元素:

#!/usr/bin/env python
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

with open('input.xml', 'rb') as file:
    soup = BeautifulSoup(file)

elements = soup.find_all("div", class_="header name quantity".split())
print("\n".join("{} {}".format(el['class'], el.get_text()) for el in elements))

Output

输出

['header']  content 
['name']  content 
['quantity']  content 
['name']  content 
['quantity']  content 
['header']  content2 
['name']  content2 
['quantity']  content2 
['name']  content2 
['quantity']  content2 

There are also other methods that allows you to search, traverse html elements.

还有其他方法可以让您搜索、遍历 html 元素