如何使用 Python 获取 HTML 文件？

Question

提问by nakiya

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.

我对 Python 不是很熟悉。我试图从以下页面中提取艺术家姓名（作为开始:)）：http: //www.infolanka.com/miyuru_gee/art/art.html。

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?

我如何检索页面？我的两个主要问题是；要使用哪些功能以及如何从页面中过滤掉无用的链接？

Answer 1

采纳答案by Vince Spicer

Example using urlib and lxml.html:

使用 urlib 和 lxml.html 的示例：

import urllib
from lxml import html

url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),
    ]

Answer 2

回答by user225312

Use urllib2to get the page.
Use BeautifulSoupto parse the HTML (the page) and get what you want!

使用urllib2获取页面。
使用BeautifulSoup解析 HTML（页面）并获得您想要的！

Answer 3

回答by Tim Barrass

And respect robots.txtand throttle your requests :)

并尊重robots.txt并限制您的请求 :)

(Apparently urllib2 does already according to this helpful SO post).

（显然 urllib2 已经根据这个有用的 SO post 做了）。

Answer 4

回答by eyquem

Or go straight forward:

或者直接前进：

import urllib

import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.urlopen(url)
li = pat.findall(sock.read())
sock.close()

print li

Answer 5

回答by Miere

I think "eyquem" way would be my choice too, but I like to use httplib2instead of urllib. urllib2is too low level lib for this work.

我认为“eyquem”方式也是我的选择，但我喜欢使用httplib2而不是urllib。urllib2对于这项工作来说是太低级的库。

import httplib2, re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
http = httplib2.Http()
headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")

li = pat.findall(body)
print li

Answer 6

回答by pulsedia

Check this my friend

检查这个我的朋友

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)

print(li)

Answer 7

回答by SysMurff

Basically, there's a function call:
render_template()
You can easly return single page or list of pages with it and it reads all files automaticaly from a your_workspace\templates.
Example:
/root_dir /templates /index1.html, /index2.html /other_dir /
routes.py
@app.route('/') def root_dir(): return render_template('index1.html')
@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)
index1.html - without params
<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>
index2.html - with params
<html> <body>  {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>

基本上，有一个函数调用：
render_template()
您可以轻松地使用它返回单个页面或页面列表，它会自动从your_workspace\templates.
例子：
/root_dir /templates /index1.html, /index2.html /other_dir /
路线.py
@app.route('/') def root_dir(): return render_template('index1.html')
@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)
index1.html - 没有参数
<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>
index2.html - 带参数
<html> <body>  {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>

如何使用 Python 获取 HTML 文件？

提问by nakiya

采纳答案by Vince Spicer

回答by user225312

回答by Tim Barrass

回答by eyquem

回答by Miere

回答by pulsedia

回答by SysMurff

相关推荐

最近更新

标签

如何使用 Python 获取 HTML 文件？

提问by nakiya

采纳答案by Vince Spicer

回答by user225312

回答by Tim Barrass

回答by eyquem

回答by Miere

回答by pulsedia

回答by SysMurff

相关推荐

Python [Errno 98] 地址已被使用

在没有外部库的情况下用 python 播放简单的哔哔声

python中的图形渲染（流程图可视化）

在python和lxml中生成xml

相关推荐

最近更新

标签