使用Python和BeautifulSoup(将网页源代码保存到本地文件中)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21570780/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Python and BeautifulSoup (saved webpage source codes into a local file)
提问by Mark K
I am using Python 2.7 + BeautifulSoup 4.3.2.
我使用的是 Python 2.7 + BeautifulSoup 4.3.2。
I am trying to use Python and BeautifulSoup to pick up information on a webpage. Because the webpage is in the company website and requires login and redirection, I copied the target page's source code page into a file and saved it as “example.html” in C:\ for the convenience of practicing.
我正在尝试使用 Python 和 BeautifulSoup 在网页上获取信息。由于网页在公司网站,需要登录和重定向,我将目标页面的源代码页复制到一个文件中,保存为“example.html”在C:\,方便练习。
This is a part of the original code:
这是原始代码的一部分:
<tr class="ghj">
<td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&u=12563">port_new_cape</a></td>
<td class="position"><a href="./search.php?id=12563&sr=positions" title="Search positions">452</a></td>
<td class="details"><div>South</div></td>
<td>May 09, 1997</td>
<td>Jan 23, 2009 12:05 pm </td>
</tr>
The code I worked out so far is:
到目前为止我制定的代码是:
from bs4 import BeautifulSoup
import re
import urllib2
url = "C:\example.html"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
cities = soup.find_all('span', {'class' : 'city-sh'})
for city in cities:
print city
This is just the first stage of testing, so it's somewhat incomplete.
这只是测试的第一阶段,所以它有点不完整。
However, when I run it, it gives an error message. Seems it's improper to use urllib2.urlopento open a local file.
但是,当我运行它时,它给出了一条错误消息。好像urllib2.urlopen用来打开本地文件不太合适。
Traceback (most recent call last):
File "C:\Python27\Testing.py", line 8, in <module>
page = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 404, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 427, in _open
'unknown_open', req)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1247, in unknown_open
raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>
How can I practice using a local file?
如何练习使用本地文件?
采纳答案by Mark K
With Chandan's help, the problem has been solved. All the credits shall go to him. :)
在昌丹的帮助下,问题得到了解决。所有的功劳都归他所有。:)
the "urllib2.url" is useless here.
“urllib2.url”在这里没用。
from bs4 import BeautifulSoup
import re
# import urllib2
url = "C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())
cities = soup.find_all('span', {'class' : 'city-sh'})
for city in cities:
print city
回答by Tanveer Alam
You can try using lxml parser also. Here is an example for your html data.
您也可以尝试使用 lxml 解析器。这是您的 html 数据示例。
from lxml.html import fromstring
import lxml.html as PARSER
data = open('example.html').read()
root = PARSER.fromstring(data)
for ele in root.getiterator():
if ele.tag == "td":
print ele.text_content()
o/p: port_new_cape 452 South May 09, 1997 Jan 23, 2009 12:05 pm?
o/p:port_new_cape 452 South 1997 年 5 月 9 日 2009 年 1 月 23 日下午 12:05?
回答by CasualDemon
The best way to open a local file with BeautifulSoup is to pass it an open file handler directly. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
使用 BeautifulSoup 打开本地文件的最佳方法是直接将打开的文件处理程序传递给它。http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("C:\example.html"), "html.parser")
for city in soup.find_all('span', {'class' : 'city-sh'}):
print(city)

