如何在 Python 中获取 Html 页面的内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2416823/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get the content of a Html page in Python
提问by Yin Zhu
I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
我已将网页下载到 html 文件中。我想知道获取该页面内容的最简单方法是什么。通过内容,我的意思是我需要浏览器会显示的字符串。
To be clear:
要清楚:
Input:
输入:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
输出:
Page title This is paragraph one. This is paragraph two.
putting together:
放在一起:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
有关的
- Python HTML removal
- Extracting text from HTML file using Python
- What is a light python library that can eliminate HTML tags? (and only text)
- Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)
- RegEx match open tags except XHTML self-contained tags(famous don't use regex to parse htmlrant)
回答by Oddthinking
Parse the HTML with Beautiful Soup.
使用Beautiful Soup解析 HTML 。
To get all the text, without the tags, try:
要获取没有标签的所有文本,请尝试:
''.join(soup.findAll(text=True))
回答by the Tin Man
Personally, I use lxml because it's a swiss-army knife...
就个人而言,我使用 lxml 因为它是一把瑞士军刀......
from lxml import html print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body>
tag then extract and print all the text.
这告诉 lxml 检索页面,定位<body>
标签,然后提取并打印所有文本。
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
我做了很多页面解析,大多数时候正则表达式是错误的解决方案,除非它是一次性的。如果页面的作者更改了他们的 HTML,则您的正则表达式很可能会被破坏。解析器更有可能继续工作。
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
解析器的一个大问题是学习如何访问您要访问的文档部分,但是您可以在浏览器中使用许多 XPATH 工具来简化任务。
回答by Pratik Deoghare
You want to look at Extracting data from HTML documents - Dive into Pythonbecause HEREit does (almost)exactly what you want.
您想查看从 HTML 文档中提取数据 - 深入 Python,因为在这里它(几乎)完全符合您的要求。
回答by Christian Hausknecht
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
此任务的最佳模块是 lxml 或 html5lib;Beautifull Soap 恕我直言不值得再使用了。对于递归模型,正则表达式绝对是错误的方法。
回答by Ankit
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
如果我正确地回答了你的问题,这可以简单地通过使用 urllib 的 urlopen 函数来完成。只需查看此函数即可打开一个 url 并读取该页面的 html 代码响应。
回答by Alexander Gessler
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re
.
获取浏览器将显示内容的可用样本的最快方法是从 html 中删除任何标签并打印其余部分。例如,这可以使用 python 的re
.