使用 BeautifulSoup 和 Python 获取元标记内容属性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36768068/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:19:09  来源:igfitidea点击:

Get meta tag content property with BeautifulSoup and Python

pythonhtmlweb-scrapingbeautifulsoup

提问by the_t_test_1

I am trying to use python and beautiful soup to extract the content part of the tags below:

我正在尝试使用python和beautiful soup来提取以下标签的内容部分:

<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />

I'm getting BeautifulSoup to load the page just fine and find other stuff (this also grabs the article id from the id tag hidden in the source), but I don't know the correct way to search the html and find these bits, I've tried variations of find and findAll to no avail. The code iterates over a list of urls at present...

我让 BeautifulSoup 加载页面并找到其他东西(这也从隐藏在源代码中的 id 标签中获取文章 ID),但我不知道搜索 html 并找到这些位的正确方法,我尝试了 find 和 findAll 的变体,但无济于事。该代码目前遍历一个 url 列表......

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup

def get_data(page_no):
    webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
    soup = BeautifulSoup(webpage, "lxml")
    for tag in soup.find_all("article") :
        id = tag.get('id')
        print id
# the hard part that doesn't work - I know this example is well off the mark!        
    title = soup.find("og:title", "content")
    print (title.get_text())
    url = soup.find("og:url", "content")
    print (url.get_text())
# end of problem

for i in range (1,100):
    get_data(i)

If anyone can help me sort the bit to find the og:title and og:content that'd be fantastic!

如果有人能帮我整理一下以找到 og:title 和 og:content 那就太棒了!

回答by alecxe

Provide the metatag name as the first argument to find(). Then, use keyword arguments to check the specific attributes:

提供meta标记名称作为 的第一个参数find()。然后,使用关键字参数检查特定属性:

title = soup.find("meta",  property="og:title")
url = soup.find("meta",  property="og:url")

print(title["content"] if title else "No meta title given")
print(url["content"] if url else "No meta url given")

The if/elsechecks here would be optional if you know that the title and url meta properties would always be present.

如果您知道 title 和 url 元属性将始终存在,则此处的if/else检查将是可选的。

回答by Hackaholic

try this :

尝试这个 :

soup = BeautifulSoup(webpage)
for tag in soup.find_all("meta"):
    if tag.get("property", None) == "og:title":
        print tag.get("content", None)
    elif tag.get("property", None) == "og:url":
        print tag.get("content", None)