如何在 Python 中获取 Html 页面的内容

Question

提问by Yin Zhu

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.

我已将网页下载到 html 文件中。我想知道获取该页面内容的最简单方法是什么。通过内容，我的意思是我需要浏览器会显示的字符串。

To be clear:

要清楚：

Input:

输入：

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

输出：

Page title This is paragraph one. This is paragraph two.

putting together:

放在一起：

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

有关的

Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)
RegEx match open tags except XHTML self-contained tags(famous don't use regex to parse htmlrant)

Python HTML 删除
使用 Python 从 HTML 文件中提取文本
什么是可以消除HTML标签的轻量级python库？（而且只有文字）
移除 AppEngine Python Env 中的 HTML 标签（相当于 Ruby 的 Sanitize）
正则表达式匹配除 XHTML 自包含标签之外的开放标签（著名的不要使用正则表达式来解析 html咆哮）

Answer 1

回答by Oddthinking

Parse the HTML with Beautiful Soup.

使用Beautiful Soup解析 HTML 。

To get all the text, without the tags, try:

要获取没有标签的所有文本，请尝试：

''.join(soup.findAll(text=True))

Answer 2

回答by the Tin Man

Personally, I use lxml because it's a swiss-army knife...

就个人而言，我使用 lxml 因为它是一把瑞士军刀......

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body>tag then extract and print all the text.

这告诉 lxml 检索页面，定位<body>标签，然后提取并打印所有文本。

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

我做了很多页面解析，大多数时候正则表达式是错误的解决方案，除非它是一次性的。如果页面的作者更改了他们的 HTML，则您的正则表达式很可能会被破坏。解析器更有可能继续工作。

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

解析器的一个大问题是学习如何访问您要访问的文档部分，但是您可以在浏览器中使用许多 XPATH 工具来简化任务。

Answer 3

回答by Pratik Deoghare

You want to look at Extracting data from HTML documents - Dive into Pythonbecause HEREit does (almost)exactly what you want.

您想查看从 HTML 文档中提取数据 - 深入 Python，因为在这里它（几乎）完全符合您的要求。

Answer 4

回答by Christian Hausknecht

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

此任务的最佳模块是 lxml 或 html5lib；Beautifull Soap 恕我直言不值得再使用了。对于递归模型，正则表达式绝对是错误的方法。

Answer 5

回答by Ankit

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

如果我正确地回答了你的问题，这可以简单地通过使用 urllib 的 urlopen 函数来完成。只需查看此函数即可打开一个 url 并读取该页面的 html 代码响应。

Answer 6

回答by Alexander Gessler

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

获取浏览器将显示内容的可用样本的最快方法是从 html 中删除任何标签并打印其余部分。例如，这可以使用 python 的re.

如何在 Python 中获取 Html 页面的内容

提问by Yin Zhu

Related

有关的

回答by Oddthinking

回答by the Tin Man

回答by Pratik Deoghare

回答by Christian Hausknecht

回答by Ankit

回答by Alexander Gessler

相关推荐

最近更新

标签

如何在 Python 中获取 Html 页面的内容

提问by Yin Zhu

Related

有关的

回答by Oddthinking

回答by the Tin Man

回答by Pratik Deoghare

回答by Christian Hausknecht

回答by Ankit

回答by Alexander Gessler

相关推荐

python args 参数的 subprocess.Popen 最大长度是多少？

如何获取上周三的 Python 日期对象

python 更改 Networkx 中的节点显示大小

如何在 Python 中使用 os.makedirs 进行错误验证？

相关推荐

最近更新

标签