Python 如何在请求中获取页面标题

Question

提问by David542

What would be the simplest way to get the title of a page in Requests?

在请求中获取页面标题的最简单方法是什么？

r = requests.get('http://www.imdb.com/title/tt0108778/')
# ? r.title
Friends (TV Series 1994–2004) - IMDb

Answer 1

采纳答案by alecxe

You need an HTML parser to parse the HTML response and get the titletag's text:

您需要一个 HTML 解析器来解析 HTML 响应并获取title标签的文本：

Example using lxml.html:

使用示例lxml.html：

>>> import requests
>>> from lxml.html import fromstring
>>> r = requests.get('http://www.imdb.com/title/tt0108778/')
>>> tree = fromstring(r.content)
>>> tree.findtext('.//title')
u'Friends (TV Series 1994\u20132004) - IMDb'

There are certainly other options, like, for example, mechanizelibrary:

当然还有其他选项，例如mechanize库：

>>> import mechanize
>>> br = mechanize.Browser()
>>> br.open('http://www.imdb.com/title/tt0108778/')
>>> br.title()
'Friends (TV Series 1994\xe2\x80\x932004) - IMDb'

What option to choose depends on what are you going to do next: parse the page to get more data, or, may be, you want to interact with it: click buttons, submit forms, follow links etc.

选择哪个选项取决于您接下来要做什么：解析页面以获取更多数据，或者，您可能想要与之交互：单击按钮、提交表单、关注链接等。

Besides, you may want to use an API provided by IMDB, instead of going down to HTML parsing, see:

此外，您可能希望使用提供的 API IMDB，而不是深入到 HTML 解析，请参阅：

Example usage of an IMDbPYpackage:

IMDbPY包的使用示例：

>>> from imdb import IMDb
>>> ia = IMDb()
>>> movie = ia.get_movie('0108778')
>>> movie['title']
u'Friends'
>>> movie['series years']
u'1994-2004'

Answer 2

回答by Greg

You could use beautifulsoup to parse the HTML.

您可以使用 beautifulsoup 来解析 HTML。

Install it using pip install beautifulsoup4

安装它使用 pip install beautifulsoup4

>>> import requests
>>> r = requests.get('http://www.imdb.com/title/tt0108778/')
>>> import bs4
>>> html = bs4.BeautifulSoup(r.text)
>>> html.title
<title>Friends (TV Series 1994–2004) - IMDb</title>
>>> html.title.text
u'Friends (TV Series 1994\u20132004) - IMDb'

Answer 3

回答by Rahul Chawla

No need to import other libraries. Request has this functionality in-built.

无需导入其他库。Request 内置了此功能。

>>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

Update after ZN13'scomment

ZN13评论后更新

>>> import re
>>> import requests
>>> n = requests.get('https://www.libsdl.org/release/SDL-1.2.15/docs/html/guideinputkeyboard.html')
>>> al = n.text
>>> d = re.search('<\W*title\W*(*)</title', al, re.IGNORECASE)
>>> d.group(1)
u'Handling the Keyboard'

This will work for all cases whether extra non alphabetical characters are present with title tag or not.

这将适用于所有情况，无论标题标签是否存在额外的非字母字符。

Answer 4

回答by Vitaly Zdanevich

Regex with lookbehind and lookforward:

具有后视和前瞻的正则表达式：

re.search('(?<=<title>).+?(?=</title>)', mytext, re.DOTALL).group().strip()

re.DOTALLbecause title can have a new line character \n

re.DOTALL因为标题可以有一个换行符 \n

Answer 5

回答by u5602117

Pythonic HTML Parsing for Humans.

人类的 Pythonic HTML 解析。

from requests_html import HTMLSession

print(HTMLSession().get('http://www.imdb.com/title/tt0108778/').html.find('title', first=True).text)

Python 如何在请求中获取页面标题

提问by David542

采纳答案by alecxe

回答by Greg

回答by Rahul Chawla

回答by Vitaly Zdanevich

回答by u5602117

相关推荐

最近更新

标签

Python 如何在请求中获取页面标题

提问by David542

采纳答案by alecxe

回答by Greg

回答by Rahul Chawla

回答by Vitaly Zdanevich

回答by u5602117

相关推荐

Python pygame.error: 视频系统未初始化

Python Sockets - 将数据包发送到服务器并等待响应

Python：如何让程序等到函数或方法完成

Python 全局变量在模块级别未定义

相关推荐

最近更新

标签