如何从python漂亮的汤中从表中获取tbody?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20522820/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:39:41  来源:igfitidea点击:

how to get tbody from table from python beautiful soup ?

pythonweb-scrapingbeautifulsoup

提问by JPC

I'm trying to scrap Year & Winners ( first & second columns ) from "List of finals matches" table (second table) from http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals: I'm using the code below:

我正在尝试从http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals 的“决赛比赛列表”表(第二个表)中删除年份和获胜者(第一列和第二列) :我正在使用以下代码:

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
soup.findAll('table')[0].tbody.findAll('tr')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

With the above code, I was able to get first & thrid column just fine. But when I use the same code with http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals, It could not find tbody as its element, but I can see the tbody when I inspect the element.

使用上面的代码,我能够很好地获得第一列和第三列。但是当我使用与 相同的代码时http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals,它找不到 tbody 作为它的元素,但是当我检查元素时我可以看到 tbody。

url = "http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"
soup = BeautifulSoup(urllib2.urlopen(url).read())

print soup.findAll('table')[2]

    soup.findAll('table')[2].tbody.findAll('tr')
    for row in soup.findAll('table')[0].tbody.findAll('tr'):
        first_column = row.findAll('th')[0].contents
        third_column = row.findAll('td')[2].contents
        print first_column, third_column

Here's what I got from comment error:

这是我从评论错误中得到的:

'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-150-fedd08c6da16> in <module>()
      7 # print soup.findAll('table')[2]
      8 
----> 9 soup.findAll('table')[2].tbody.findAll('tr')
     10 for row in soup.findAll('table')[0].tbody.findAll('tr'):
     11     first_column = row.findAll('th')[0].contents

AttributeError: 'NoneType' object has no attribute 'findAll'

'

回答by Derek Litz

If you are inspecting through the inspect tool in the browser it will insert the tbodytags.

如果您通过浏览器中的检查工具进行检查,它将插入tbody标签。

The source code, may, or may not contain them. I suggest looking at the source view if you really want to know.

源代码可能包含也可能不包含它们。如果您真的想知道,我建议查看源视图。

Either way, you do not need to traverse to the tbody, simply:

无论哪种方式,您都不需要遍历 tbody,只需:

soup.findAll('table')[0].findAll('tr')should work.

soup.findAll('table')[0].findAll('tr')应该管用。

回答by GMPrazzoli

url = "http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"
soup = BeautifulSoup(urllib2.urlopen(url).read())
for tr in soup.findAll('table')[2].findAll('tr'):
    #get data

And then search what you need in the table :)

然后在表中搜索您需要的内容:)

回答by Rohit Yadav

Directly run the below code.

直接运行下面的代码。

tr_elements = soup.find_all('table')[2].find_all('tr')

tr_elements = soup.find_all('table')[2].find_all('tr')

By doing this, you can access the all the <tr>; You will have to use for loop for doing this (There are other possible ways to iterate too). Don't try to find the tbody, it gets added by default.

通过这样做,您可以访问所有的<tr>;您将不得不使用 for 循环来执行此操作(还有其他可能的迭代方法)。不要试图找到tbody,默认情况下它会被添加。

Note:

笔记:

If you are having a problem in getting to the desired tag, decompose the previous tags with .decompose()method.

如果在获取所需标签时遇到问题,请使用.decompose()方法分解先前的标签。