Python 中的 BeautifulSoup - 获取类型的第 n 个标签

Question

提问by nasonfish

I have some html code that contains many <table>s in it.

我有一些包含许多<table>s 的html 代码。

I'm trying to get the information in the second table. Is there a way to do this without using soup.findAll('table')?

我正在尝试获取第二个表中的信息。有没有办法在不使用的情况下做到这一点soup.findAll('table')？

When I do use soup.findAll('table'), I get an error:

当我使用时soup.findAll('table')，我收到一个错误：

ValueError: too many values to unpack

Is there a way to get the n-th tag in some code or another way that does not require going through all the tables? Or should I see if I can add titles to the tables? (like <table title="things">)

有没有办法以某种代码或另一种不需要遍历所有表的方式获取第 n 个标签？或者我应该看看我是否可以为表格添加标题？（喜欢<table title="things">）

There are also headers (<h4>title</h4>) above each table, if that helps.

<h4>title</h4>如果有帮助，每个表格上方还有标题 ( )。

Thanks.

谢谢。

EDIT

编辑

Here's what I was thinking when I asked the question:

当我问这个问题时，我的想法是这样的：

I was unpacking the objects into two values, when there were many more. I thought this would just give me the first two things from the list, but of course, it kept giving me the error mentioned above. I was unaware the return value was a list and thought it was a special object or something and I was basing my code off of my friends'.

我正在将对象解包成两个值，但还有更多。我认为这只会给我列表中的前两件事，但当然，它一直给我上面提到的错误。我不知道返回值是一个列表，并认为它是一个特殊的对象或其他东西，我的代码基于我朋友的。

I was thinking this error meant there were too many tables on the page and that it couldn't handle all of them, so I was asking for a way to do it without the method I was using. I probably should have stopped assuming things.

我在想这个错误意味着页面上有太多的表，而且它无法处理所有的表，所以我想寻求一种不用我使用的方法就能做到的方法。我可能应该停止假设。

Now I know it returns a list and I can use this in a for loop or get a value from it with soup.findAll('table')[someNumber]. I learned what unpacking was and how to use it, as well. Thanks everyone who helped.

现在我知道它返回一个列表，我可以在 for 循环中使用它或使用soup.findAll('table')[someNumber]. 我还了解了拆包是什么以及如何使用它。感谢所有帮助过的人。

Hopefully that clears things up, now that I know what I'm doing my question makes less sense than it did when I asked it, so I thought I'd just put a note here on what I was thinking.

希望这能解决问题，现在我知道我在做什么，我的问题比我问它时更没有意义，所以我想我只是在这里写下我的想法。

EDIT 2:

编辑2：

This question is now pretty old, but I still see that I was never really clear about what I was doing.

这个问题现在已经很老了，但我仍然看到我从来没有真正清楚自己在做什么。

If it helps anyone, I was attempting to unpack the findAll(...)results, of which the amount of them I didn't know.

如果它对任何人有帮助，我正试图解开findAll(...)结果，其中的数量我不知道。

useless_table, table_i_want, another_useless_table = soup.findAll("table");

Since there weren't always the amount of tables I had guessed in the page, and all the values in the tuple need to be unpacked, I was receiving the ValueError:

由于页面中并不总是有我猜到的表数量，并且元组中的所有值都需要解包，因此我收到了ValueError：

ValueError: too many values to unpack

So, I was looking for the way to grab the second (or whichever index) table in the tuple returned without running into errors about how many tables were used.

因此，我正在寻找获取元组中第二个（或任何一个索引）表的方法，而不会遇到有关使用了多少表的错误。

Answer 1

采纳答案by Martijn Pieters

To get the second table from the call soup.findAll('table'), use it as a list, just index it:

要从 call 中获取第二个表，请将soup.findAll('table')其用作列表，只需对其进行索引：

secondtable = soup.findAll('table')[1]

Answer 2

回答by B.Mr.W.

Martjin Pieter's answer will make it work indeed. I had some experience with nested tabletag which broke my code when I just simply get the second table in the list without paying attention.

Martjin Pieter 的回答将使它真正起作用。我有一些嵌套table标签的经验，当我只是简单地获取列表中的第二个表而没有注意时，它破坏了我的代码。

When you try to find_alland get the nth element, there is a potential you will mess up, you had better locate the first element you want and make sure the n-th element is actually a sibling of that element instead of children.

当您尝试find_all获取第 n 个元素时，您可能会搞砸，您最好找到您想要的第一个元素，并确保第 n 个元素实际上是该元素的兄弟元素而不是子元素。

You can use the find_next_sibling()to secure your code
you can find the parent first and then use find_all(recursive=False) to guarantee your search range.

您可以使用find_next_sibling()来保护您的代码
您可以先找到父级，然后使用 find_all(recursive=False) 来保证您的搜索范围。

Just in case you need it. I will list my code below(use recursive=FALSE).

以防万一你需要它。我将在下面列出我的代码（使用递归 = FALSE）。

import urllib2
from bs4 import BeautifulSoup

text = """
<html>
    <head>
    </head>
    <body>
        <table>
            <p>Table1</p>
            <table>
                <p>Extra Table</p>
            </table>
        </table>
        <table>
            <p>Table2</p>
        </table>
    </body>
</html>
"""

soup = BeautifulSoup(text)

tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning

tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output

Answer 3

回答by Sergei

Here's my version

这是我的版本

# Import bs4
from bs4 import BeautifulSoup

# Read your HTML
#html_doc = your html

# Get BS4 object
soup = BeautifulSoup(html_doc, "lxml")

# Find next Sibling Table to H3 Header with text "THE GOOD STUFF"    
the_good_table = soup.find(name='h3', text='THE GOOD STUFF').find_next_sibling(name='table')

# Find Second tr in your table
your_tr = the_good_table.findAll(name='tr')[1]

# Find Text Value of First td in your tr
your_string = your_tr.td.text

print(your_string)

Output:

输出：

'I WANT THIS STRING'

Python 中的 BeautifulSoup - 获取类型的第 n 个标签

提问by nasonfish

采纳答案by Martijn Pieters

回答by B.Mr.W.

回答by Sergei

相关推荐

最近更新

标签

Python 中的 BeautifulSoup - 获取类型的第 n 个标签

提问by nasonfish

采纳答案by Martijn Pieters

回答by B.Mr.W.

回答by Sergei

相关推荐

Python Django，按日期范围内的指定月份和年份过滤

Python “日志”和“符号日志”有什么区别？

你如何在 Python 中使用 subprocess.check_output()？

Python csv 字符串到数组

相关推荐

最近更新

标签