如何使用 Python 获取 HTML 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4489550/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get an HTML file using Python?
提问by nakiya
I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.
我对 Python 不是很熟悉。我试图从以下页面中提取艺术家姓名(作为开始:)):http: //www.infolanka.com/miyuru_gee/art/art.html。
How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?
我如何检索页面?我的两个主要问题是;要使用哪些功能以及如何从页面中过滤掉无用的链接?
采纳答案by Vince Spicer
Example using urlib and lxml.html:
使用 urlib 和 lxml.html 的示例:
import urllib
from lxml import html
url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())
for link in page.xpath("//a"):
print "Name", link.text, "URL", link.get("href")
output >>
[('Aathma Liyanage', 'athma.html'),
('Abewardhana Balasuriya', 'abewardhana.html'),
('Aelian Thilakeratne', 'aelian_thi.html'),
('Ahamed Mohideen', 'ahamed.html'),
]
回答by user225312
Use urllib2to get the page.
Use BeautifulSoupto parse the HTML (the page) and get what you want!
使用urllib2获取页面。
使用BeautifulSoup解析 HTML(页面)并获得您想要的!
回答by Tim Barrass
And respect robots.txtand throttle your requests :)
并尊重robots.txt并限制您的请求 :)
(Apparently urllib2 does already according to this helpful SO post).
(显然 urllib2 已经根据这个有用的 SO post 做了)。
回答by eyquem
Or go straight forward:
或者直接前进:
import urllib
import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.urlopen(url)
li = pat.findall(sock.read())
sock.close()
print li
回答by Miere
I think "eyquem" way would be my choice too, but I like to use httplib2instead of urllib. urllib2is too low level lib for this work.
我认为“eyquem”方式也是我的选择,但我喜欢使用httplib2而不是urllib。urllib2对于这项工作来说是太低级的库。
import httplib2, re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
http = httplib2.Http()
headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")
li = pat.findall(body)
print li
回答by pulsedia
Check this my friend
检查这个我的朋友
import urllib.request
import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.request.urlopen(url).read().decode("utf-8")
li = pat.findall(sock)
print(li)
回答by SysMurff
Basically, there's a function call:
render_template()You can easly return single page or list of pages with it and it reads all files automaticaly from a
your_workspace\templates.Example:
/root_dir /templates /index1.html, /index2.html /other_dir /routes.py
@app.route('/') def root_dir(): return render_template('index1.html')
@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)index1.html - without params
<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>index2.html - with params
<html> <body> <!-- Built-it conditional functions in the framework templates in Flask --> {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>
基本上,有一个函数调用:
render_template()您可以轻松地使用它返回单个页面或页面列表,它会自动从
your_workspace\templates.例子:
/root_dir /templates /index1.html, /index2.html /other_dir /路线.py
@app.route('/') def root_dir(): return render_template('index1.html')
@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)index1.html - 没有参数
<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>index2.html - 带参数
<html> <body> <!-- Built-it conditional functions in the framework templates in Flask --> {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>

![Python [Errno 98] 地址已被使用](/res/img/loading.gif)