如何在 Python 中获取 Html 页面的内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2416823/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:36:44  来源:igfitidea点击:

How to get the content of a Html page in Python

pythonhtmlparsing

提问by Yin Zhu

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.

我已将网页下载到 html 文件中。我想知道获取该页面内容的最简单方法是什么。通过内容,我的意思是我需要浏览器会显示的字符串。

To be clear:

要清楚:

Input:

输入:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

输出:

Page title This is paragraph one. This is paragraph two.

putting together:

放在一起:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Related

有关的

回答by Oddthinking

Parse the HTML with Beautiful Soup.

使用Beautiful Soup解析 HTML 。

To get all the text, without the tags, try:

要获取没有标签的所有文本,请尝试:

''.join(soup.findAll(text=True))

回答by the Tin Man

Personally, I use lxml because it's a swiss-army knife...

就个人而言,我使用 lxml 因为它是一把瑞士军刀......

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body>tag then extract and print all the text.

这告诉 lxml 检索页面,定位<body>标签,然后提取并打印所有文本。

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

我做了很多页面解析,大多数时候正则表达式是错误的解决方案,除非它是一次性的。如果页面的作者更改了他们的 HTML,则您的正则表达式很可能会被破坏。解析器更有可能继续工作。

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

解析器的一个大问题是学习如何访问您要访问的文档部分,但是您可以在浏览器中使用许多 XPATH 工具来简化任务。

回答by Pratik Deoghare

You want to look at Extracting data from HTML documents - Dive into Pythonbecause HEREit does (almost)exactly what you want.

您想查看从 HTML 文档提取数据 - 深入 Python,因为在这里它(几乎)完全符合您的要求。

回答by Christian Hausknecht

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

此任务的最佳模块是 lxml 或 html5lib;Beautifull Soap 恕我直言不值得再使用了。对于递归模型,正则表达式绝对是错误的方法。

回答by Ankit

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

如果我正确地回答了你的问题,这可以简单地通过使用 urllib 的 urlopen 函数来完成。只需查看此函数即可打开一个 url 并读取该页面的 html 代码响应。

回答by Alexander Gessler

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

获取浏览器将显示内容的可用样本的最快方法是从 html 中删除任何标签并打印其余部分。例如,这可以使用 python 的re.