Python Beautifulsoup:解析html——获取href的一部分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41720896/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:30:23  来源:igfitidea点击:

Beautifulsoup: parsing html – get part of href

pythonweb-scrapingbeautifulsouprequest

提问by

I'm trying to parse

我正在尝试解析

<td height="16" class="listtable_1"><a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">76561198134729239</a></td>

for the 76561198134729239. and I can't figure out how to do it. what I tried:

对于 76561198134729239. 我不知道该怎么做。我试过的:

import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td", 
{
    "class":"listtable_1",
    "target":"_blank"
})
print(element.text)

采纳答案by Martin Evans

There are many such entries in that HTML. To get all of them you could use the following:

在那个 HTML 中有很多这样的条目。要获得所有这些,您可以使用以下方法:

import requests
from lxml import html
from bs4 import BeautifulSoup

r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")

for td in soup.findAll("td", class_="listtable_1"):
    for a in td.findAll("a", href=True, target="_blank"):
        print(a.text)

This would then return:

这将返回:

76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

回答by MYGz

"target":"_blank"is a class of anchor tag awithin the tdtag. It's not a class of tdtag.

"target":"_blank"是标签a内的一类锚td标签。它不是一类td标签。

You can get it like so:

你可以这样得到它:

from bs4 import BeautifulSoup

html="""
<td height="16" class="listtable_1">
    <a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
        76561198134729239
    </a>
</td>"""

soup = BeautifulSoup(html, 'html.parser')

print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text)

Output:

输出:

76561198134729239

回答by u6856342

"class":"listtable_1"belong to tdtag and target="_blank"belong to atag, you should not use them together.

"class":"listtable_1"属于td标签和target="_blank"属于a标签,你不应该一起使用它们。

you should use Steam Communityas an anchor to find the numbers after it. enter image description here

您应该将其Steam Community用作锚来查找其后的数字。 在此处输入图片说明

OR use URL, The URL contain the info you need and it's easy to find, you can find the URL and split it by /:

或者使用 URL,URL 包含您需要的信息并且很容易找到,您可以找到 URL 并将其拆分为/

for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
    num = a['href'].split('/')[-1]
    print(num)

Code:

代码:

import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
    num = td.find_next_sibling('td').text
    print(num)

out:

出去:

76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

回答by alecxe

As others mentioned you are trying to check attributes of different elements in a single find(). Instead, you can chain find()calls as MYGz suggested, or use a single CSS selector:

正如其他人提到的,您正在尝试检查单个find(). 相反,您可以find()按照 MYGz 的建议链接调用,或使用单个CSS 选择器

soup.select_one("td.listtable_1 a[target=_blank]").get_text()

If, you need to locate multiple elements this way, use select():

如果需要以这种方式定位多个元素,请使用select()

for elm in soup.select("td.listtable_1 a[target=_blank]"):
    print(elm.get_text())