Python 中的 BeautifulSoup - 获取类型的第 n 个标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14095511/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeautifulSoup in Python - getting the n-th tag of a type
提问by nasonfish
I have some html code that contains many <table>s in it.
我有一些包含许多<table>s 的html 代码。
I'm trying to get the information in the second table. Is there a way to do this without using soup.findAll('table')?
我正在尝试获取第二个表中的信息。有没有办法在不使用的情况下做到这一点soup.findAll('table')?
When I do use soup.findAll('table'), I get an error:
当我使用时soup.findAll('table'),我收到一个错误:
ValueError: too many values to unpack
Is there a way to get the n-th tag in some code or another way that does not require going through all the tables? Or should I see if I can add titles to the tables? (like <table title="things">)
有没有办法以某种代码或另一种不需要遍历所有表的方式获取第 n 个标签?或者我应该看看我是否可以为表格添加标题?(喜欢<table title="things">)
There are also headers (<h4>title</h4>) above each table, if that helps.
<h4>title</h4>如果有帮助,每个表格上方还有标题 ( )。
Thanks.
谢谢。
EDIT
编辑
Here's what I was thinking when I asked the question:
当我问这个问题时,我的想法是这样的:
I was unpacking the objects into two values, when there were many more. I thought this would just give me the first two things from the list, but of course, it kept giving me the error mentioned above. I was unaware the return value was a list and thought it was a special object or something and I was basing my code off of my friends'.
我正在将对象解包成两个值,但还有更多。我认为这只会给我列表中的前两件事,但当然,它一直给我上面提到的错误。我不知道返回值是一个列表,并认为它是一个特殊的对象或其他东西,我的代码基于我朋友的。
I was thinking this error meant there were too many tables on the page and that it couldn't handle all of them, so I was asking for a way to do it without the method I was using. I probably should have stopped assuming things.
我在想这个错误意味着页面上有太多的表,而且它无法处理所有的表,所以我想寻求一种不用我使用的方法就能做到的方法。我可能应该停止假设。
Now I know it returns a list and I can use this in a for loop or get a value from it with soup.findAll('table')[someNumber]. I learned what unpacking was and how to use it, as well. Thanks everyone who helped.
现在我知道它返回一个列表,我可以在 for 循环中使用它或使用soup.findAll('table')[someNumber]. 我还了解了拆包是什么以及如何使用它。感谢所有帮助过的人。
Hopefully that clears things up, now that I know what I'm doing my question makes less sense than it did when I asked it, so I thought I'd just put a note here on what I was thinking.
希望这能解决问题,现在我知道我在做什么,我的问题比我问它时更没有意义,所以我想我只是在这里写下我的想法。
EDIT 2:
编辑2:
This question is now pretty old, but I still see that I was never really clear about what I was doing.
这个问题现在已经很老了,但我仍然看到我从来没有真正清楚自己在做什么。
If it helps anyone, I was attempting to unpack the findAll(...)results, of which the amount of them I didn't know.
如果它对任何人有帮助,我正试图解开findAll(...)结果,其中的数量我不知道。
useless_table, table_i_want, another_useless_table = soup.findAll("table");
Since there weren't always the amount of tables I had guessed in the page, and all the values in the tuple need to be unpacked, I was receiving the ValueError:
由于页面中并不总是有我猜到的表数量,并且元组中的所有值都需要解包,因此我收到了ValueError:
ValueError: too many values to unpack
So, I was looking for the way to grab the second (or whichever index) table in the tuple returned without running into errors about how many tables were used.
因此,我正在寻找获取元组中第二个(或任何一个索引)表的方法,而不会遇到有关使用了多少表的错误。
采纳答案by Martijn Pieters
To get the second table from the call soup.findAll('table'), use it as a list, just index it:
要从 call 中获取第二个表,请将soup.findAll('table')其用作列表,只需对其进行索引:
secondtable = soup.findAll('table')[1]
回答by B.Mr.W.
Martjin Pieter's answer will make it work indeed. I had some experience with nested tabletag which broke my code when I just simply get the second table in the list without paying attention.
Martjin Pieter 的回答将使它真正起作用。我有一些嵌套table标签的经验,当我只是简单地获取列表中的第二个表而没有注意时,它破坏了我的代码。
When you try to find_alland get the nth element, there is a potential you will mess up, you had better locate the first element you want and make sure the n-th element is actually a sibling of that element instead of children.
当您尝试find_all获取第 n 个元素时,您可能会搞砸,您最好找到您想要的第一个元素,并确保第 n 个元素实际上是该元素的兄弟元素而不是子元素。
- You can use the
find_next_sibling()to secure your code - you can find the parent first and then use find_all(recursive=False) to guarantee your search range.
- 您可以使用
find_next_sibling()来保护您的代码 - 您可以先找到父级,然后使用 find_all(recursive=False) 来保证您的搜索范围。
Just in case you need it. I will list my code below(use recursive=FALSE).
以防万一你需要它。我将在下面列出我的代码(使用递归 = FALSE)。
import urllib2
from bs4 import BeautifulSoup
text = """
<html>
<head>
</head>
<body>
<table>
<p>Table1</p>
<table>
<p>Extra Table</p>
</table>
</table>
<table>
<p>Table2</p>
</table>
</body>
</html>
"""
soup = BeautifulSoup(text)
tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning
tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output
回答by Sergei
Here's my version
这是我的版本
# Import bs4
from bs4 import BeautifulSoup
# Read your HTML
#html_doc = your html
# Get BS4 object
soup = BeautifulSoup(html_doc, "lxml")
# Find next Sibling Table to H3 Header with text "THE GOOD STUFF"
the_good_table = soup.find(name='h3', text='THE GOOD STUFF').find_next_sibling(name='table')
# Find Second tr in your table
your_tr = the_good_table.findAll(name='tr')[1]
# Find Text Value of First td in your tr
your_string = your_tr.td.text
print(your_string)
Output:
输出:
'I WANT THIS STRING'

