Python 使用 BeautifulSoup 从未关闭的特定元标记中提取内容

Question

提问by tcash21

I'm trying to parse out content from specific meta tags. Here's the structure of the meta tags. The first two are closed with a backslash, but the rest don't have any closing tags. As soon as I get the 3rd meta tag, the entire contents between the <head>tags are returned. I've also tried soup.findAll(text=re.compile('keyword'))but that does not return anything since keyword is an attribute of the meta tag.

我正在尝试从特定的元标记中解析出内容。这是元标记的结构。前两个以反斜杠结束，但其余没有任何结束标记。一旦我得到第 3 个元标记，就会<head>返回标记之间的全部内容。我也试过，soup.findAll(text=re.compile('keyword'))但没有返回任何东西，因为关键字是元标记的一个属性。

<meta name="csrf-param" content="authenticity_token"/>
<meta name="csrf-token" content="OrpXIt/y9zdAFHWzJXY2EccDi1zNSucxcCOu8+6Mc9c="/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
<meta content='en_US' http-equiv='Content-Language'>
<meta content='c2y_K2CiLmGeet7GUQc9e3RVGp_gCOxUC4IdJg_RBVo' name='google-site-    verification'>
<meta content='initial-scale=1.0,maximum-scale=1.0,width=device-width' name='viewport'>
<meta content='' name='google'>
<meta content="Learn about Uber's product, founders, investors and team. Everyone's Private Driver - Request a car from any mobile phone—text message, iPhone and Android apps. Within minutes, a professional driver in a sleek black car will arrive curbside. Automatically charged to your credit card on file, tip included." name='description'>

Here's the code:

这是代码：

import csv
import re
import sys
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req3 = Request("https://angel.co/uber", headers={'User-Agent': 'Mozilla/5.0')
page3 = urlopen(req3).read()
soup3 = BeautifulSoup(page3)

## This returns the entire web page since the META tags are not closed
desc = soup3.findAll(attrs={"name":"description"})

Answer 1

采纳答案by sihrc

Edited: Added regex for case sensitivity as suggested by @Albert Chen.

编辑：添加了@Albert Chen 建议的区分大小写的正则表达式。

Python 3 Edit:

Python 3 编辑：

from bs4 import BeautifulSoup
import re
import urllib.request

page3 = urllib.request.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'])

Although I'm not sure it will work for every page:

虽然我不确定它是否适用于每个页面：

from bs4 import BeautifulSoup
import re
import urllib

page3 = urllib.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'].encode('utf-8'))

Yields:

产量：

Learn about Uber's product, founders, investors and team. Everyone's Private Dri
ver - Request a car from any mobile phoneΓ??text message, iPhone and Android app
s. Within minutes, a professional driver in a sleek black car will arrive curbsi
de. Automatically charged to your credit card on file, tip included.

Answer 2

回答by shobhit_mittal

Description is Case-Sensitive.So, we need to look for both 'Description' and 'description'.

描述区分大小写。因此，我们需要同时查找“描述”和“描述”。

Case1: 'Description' in Flipkart.com

案例 1：Flipkart.com 中的“描述”

Case2: 'description' in Snapdeal.com

案例 2：Snapdeal.com 中的“描述”

from bs4 import BeautifulSoup
import requests

url= 'https://www.flipkart.com'
page3= requests.get(url)
soup3= BeautifulSoup(page3.text)
desc= soup3.find(attrs={'name':'Description'})
if desc == None:
    desc= soup3.find(attrs={'name':'description'})
try:
    print desc['content']
except Exception as e:
    print '%s (%s)' % (e.message, type(e))

Answer 3

回答by ingo

soup3 = BeautifulSoup(page3, 'html5lib')

xhtml requires the meta tag to be closed properly, html5 does not. The html5lib parser is more "permissive".

xhtml 需要正确关闭元标记，而 html5 则不需要。html5lib 解析器更加“宽容”。

Answer 4

回答by user799188

Try (based on thisblog post)

尝试（基于此博客文章）

from bs4 import BeautifulSoup
...
desc = ""
for meta in soup.findAll("meta"):
    metaname = meta.get('name', '').lower()
    metaprop = meta.get('property', '').lower()
    if 'description' == metaname or metaprop.find("description")>0:
        desc = meta['content'].strip()

Tested against the following variants:

针对以下变体进行了测试：

<meta name="description" content="blah blah" />(Example)
<meta id="MetaDescription" name="DESCRIPTION" content="blah blah" />(Example)
<meta property="og:description" content="blah blah" />(Example)

<meta name="description" content="blah blah" />（示例）
<meta id="MetaDescription" name="DESCRIPTION" content="blah blah" />（示例）
<meta property="og:description" content="blah blah" />（示例）

Used BeautifulSoup version 4.4.1

使用 BeautifulSoup 版本4.4.1

Answer 5

回答by Paolinux

As suggested by ingo you could use a less strict parser like html5.

正如 ingo 所建议的，您可以使用不那么严格的解析器，例如 html5。

soup3 = BeautifulSoup(page3, 'html5lib')

but be sure to have python-html5libparser available on the system.

但一定要python-html5lib在系统上有可用的解析器。

Answer 6

回答by Albert Chen

I think here use regexp should be better: example:

我认为这里使用正则表达式应该更好：例如：

resp = requests.get('url')
soup = BeautifulSoup(resp.text)
desc = soup.find_all(attrs={"name": re.compile(r'Description', re.I)})

Python 使用 BeautifulSoup 从未关闭的特定元标记中提取内容

提问by tcash21

采纳答案by sihrc

回答by shobhit_mittal

回答by ingo

回答by user799188

回答by Paolinux

回答by Albert Chen

相关推荐

最近更新

标签

Python 使用 BeautifulSoup 从未关闭的特定元标记中提取内容

提问by tcash21

采纳答案by sihrc

回答by shobhit_mittal

回答by ingo

回答by user799188

回答by Paolinux

回答by Albert Chen

相关推荐

Python Pandas Dataframes to_html：突出显示表格行

使用Python将文本文件中的复数转换为单数

在python中动态声明/创建列表

Python 在 pandas.query() 中使用 LIKE

相关推荐

最近更新

标签