Python 如何在请求中获取页面标题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26812470/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:02:47  来源:igfitidea点击:

How to get page title in requests

pythonhtmlhtml-parsing

提问by David542

What would be the simplest way to get the title of a page in Requests?

在请求中获取页面标题的最简单方法是什么?

r = requests.get('http://www.imdb.com/title/tt0108778/')
# ? r.title
Friends (TV Series 1994–2004) - IMDb

采纳答案by alecxe

You need an HTML parser to parse the HTML response and get the titletag's text:

您需要一个 HTML 解析器来解析 HTML 响应并获取title标签的文本:

Example using lxml.html:

使用示例lxml.html

>>> import requests
>>> from lxml.html import fromstring
>>> r = requests.get('http://www.imdb.com/title/tt0108778/')
>>> tree = fromstring(r.content)
>>> tree.findtext('.//title')
u'Friends (TV Series 1994\u20132004) - IMDb'

There are certainly other options, like, for example, mechanizelibrary:

当然还有其他选项,例如mechanize库:

>>> import mechanize
>>> br = mechanize.Browser()
>>> br.open('http://www.imdb.com/title/tt0108778/')
>>> br.title()
'Friends (TV Series 1994\xe2\x80\x932004) - IMDb'

What option to choose depends on what are you going to do next: parse the page to get more data, or, may be, you want to interact with it: click buttons, submit forms, follow links etc.

选择哪个选项取决于您接下来要做什么:解析页面以获取更多数据,或者,您可能想要与之交互:单击按钮、提交表单、关注链接等。

Besides, you may want to use an API provided by IMDB, instead of going down to HTML parsing, see:

此外,您可能希望使用 提供的 API IMDB,而不是深入到 HTML 解析,请参阅:

Example usage of an IMDbPYpackage:

IMDbPY包的使用示例:

>>> from imdb import IMDb
>>> ia = IMDb()
>>> movie = ia.get_movie('0108778')
>>> movie['title']
u'Friends'
>>> movie['series years']
u'1994-2004'

回答by Greg

You could use beautifulsoup to parse the HTML.

您可以使用 beautifulsoup 来解析 HTML。

Install it using pip install beautifulsoup4

安装它使用 pip install beautifulsoup4

>>> import requests
>>> r = requests.get('http://www.imdb.com/title/tt0108778/')
>>> import bs4
>>> html = bs4.BeautifulSoup(r.text)
>>> html.title
<title>Friends (TV Series 1994–2004) - IMDb</title>
>>> html.title.text
u'Friends (TV Series 1994\u20132004) - IMDb'

回答by Rahul Chawla

No need to import other libraries. Request has this functionality in-built.

无需导入其他库。Request 内置了此功能。

>>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

Update after ZN13'scomment

ZN13评论后更新

>>> import re
>>> import requests
>>> n = requests.get('https://www.libsdl.org/release/SDL-1.2.15/docs/html/guideinputkeyboard.html')
>>> al = n.text
>>> d = re.search('<\W*title\W*(*)</title', al, re.IGNORECASE)
>>> d.group(1)
u'Handling the Keyboard'

This will work for all cases whether extra non alphabetical characters are present with title tag or not.

这将适用于所有情况,无论标题标签是否存在额外的非字母字符。

回答by Vitaly Zdanevich

Regex with lookbehind and lookforward:

具有后视和前瞻的正则表达式:

re.search('(?<=<title>).+?(?=</title>)', mytext, re.DOTALL).group().strip()

re.DOTALLbecause title can have a new line character \n

re.DOTALL因为标题可以有一个换行符 \n

回答by u5602117

Pythonic HTML Parsing for Humans.

人类的 Pythonic HTML 解析。

from requests_html import HTMLSession

print(HTMLSession().get('http://www.imdb.com/title/tt0108778/').html.find('title', first=True).text)