Python 使用 BeautifulSoup 从未关闭的特定元标记中提取内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18134318/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:57:59  来源:igfitidea点击:

Extracting contents from specific meta tags that are not closed using BeautifulSoup

pythonbeautifulsoup

提问by tcash21

I'm trying to parse out content from specific meta tags. Here's the structure of the meta tags. The first two are closed with a backslash, but the rest don't have any closing tags. As soon as I get the 3rd meta tag, the entire contents between the <head>tags are returned. I've also tried soup.findAll(text=re.compile('keyword'))but that does not return anything since keyword is an attribute of the meta tag.

我正在尝试从特定的元标记中解析出内容。这是元标记的结构。前两个以反斜杠结束,但其余没有任何结束标记。一旦我得到第 3 个元标记,就会<head>返回标记之间的全部内容。我也试过,soup.findAll(text=re.compile('keyword'))但没有返回任何东西,因为关键字是元标记的一个属性。

<meta name="csrf-param" content="authenticity_token"/>
<meta name="csrf-token" content="OrpXIt/y9zdAFHWzJXY2EccDi1zNSucxcCOu8+6Mc9c="/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
<meta content='en_US' http-equiv='Content-Language'>
<meta content='c2y_K2CiLmGeet7GUQc9e3RVGp_gCOxUC4IdJg_RBVo' name='google-site-    verification'>
<meta content='initial-scale=1.0,maximum-scale=1.0,width=device-width' name='viewport'>
<meta content='' name='google'>
<meta content="Learn about Uber's product, founders, investors and team. Everyone's Private Driver - Request a car from any mobile phone—text message, iPhone and Android apps. Within minutes, a professional driver in a sleek black car will arrive curbside. Automatically charged to your credit card on file, tip included." name='description'>

Here's the code:

这是代码:

import csv
import re
import sys
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req3 = Request("https://angel.co/uber", headers={'User-Agent': 'Mozilla/5.0')
page3 = urlopen(req3).read()
soup3 = BeautifulSoup(page3)

## This returns the entire web page since the META tags are not closed
desc = soup3.findAll(attrs={"name":"description"}) 

采纳答案by sihrc

Edited: Added regex for case sensitivity as suggested by @Albert Chen.

编辑:添加了@Albert Chen 建议的区分大小写的正则表达式。

Python 3 Edit:

Python 3 编辑:

from bs4 import BeautifulSoup
import re
import urllib.request

page3 = urllib.request.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'])

Although I'm not sure it will work for every page:

虽然我不确定它是否适用于每个页面:

from bs4 import BeautifulSoup
import re
import urllib

page3 = urllib.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'].encode('utf-8'))

Yields:

产量:

Learn about Uber's product, founders, investors and team. Everyone's Private Dri
ver - Request a car from any mobile phoneΓ??text message, iPhone and Android app
s. Within minutes, a professional driver in a sleek black car will arrive curbsi
de. Automatically charged to your credit card on file, tip included.

回答by shobhit_mittal

Description is Case-Sensitive.So, we need to look for both 'Description' and 'description'.

描述区分大小写。因此,我们需要同时查找“描述”和“描述”。

Case1: 'Description' in Flipkart.com

案例 1:Flipkart.com 中的“描述”

Case2: 'description' in Snapdeal.com

案例 2:Snapdeal.com 中的“描述”

from bs4 import BeautifulSoup
import requests

url= 'https://www.flipkart.com'
page3= requests.get(url)
soup3= BeautifulSoup(page3.text)
desc= soup3.find(attrs={'name':'Description'})
if desc == None:
    desc= soup3.find(attrs={'name':'description'})
try:
    print desc['content']
except Exception as e:
    print '%s (%s)' % (e.message, type(e))

回答by ingo

soup3 = BeautifulSoup(page3, 'html5lib')

xhtml requires the meta tag to be closed properly, html5 does not. The html5lib parser is more "permissive".

xhtml 需要正确关闭元标记,而 html5 则不需要。html5lib 解析器更加“宽容”。

回答by user799188

Try (based on thisblog post)

尝试(基于博客文章)

from bs4 import BeautifulSoup
...
desc = ""
for meta in soup.findAll("meta"):
    metaname = meta.get('name', '').lower()
    metaprop = meta.get('property', '').lower()
    if 'description' == metaname or metaprop.find("description")>0:
        desc = meta['content'].strip()

Tested against the following variants:

针对以下变体进行了测试:

  • <meta name="description" content="blah blah" />(Example)
  • <meta id="MetaDescription" name="DESCRIPTION" content="blah blah" />(Example)
  • <meta property="og:description" content="blah blah" />(Example)
  • <meta name="description" content="blah blah" />示例
  • <meta id="MetaDescription" name="DESCRIPTION" content="blah blah" />示例
  • <meta property="og:description" content="blah blah" />示例

Used BeautifulSoup version 4.4.1

使用 BeautifulSoup 版本4.4.1

回答by Paolinux

As suggested by ingo you could use a less strict parser like html5.

正如 ingo 所建议的,您可以使用不那么严格的解析器,例如 html5。

soup3 = BeautifulSoup(page3, 'html5lib')

but be sure to have python-html5libparser available on the system.

但一定要python-html5lib在系统上有可用的解析器。

回答by Albert Chen

I think here use regexp should be better: example:

我认为这里使用正则表达式应该更好:例如:

resp = requests.get('url')
soup = BeautifulSoup(resp.text)
desc = soup.find_all(attrs={"name": re.compile(r'Description', re.I)})