使用Python和BeautifulSoup（将网页源代码保存到本地文件中）

Question

提问by Mark K

I am using Python 2.7 + BeautifulSoup 4.3.2.

我使用的是 Python 2.7 + BeautifulSoup 4.3.2。

I am trying to use Python and BeautifulSoup to pick up information on a webpage. Because the webpage is in the company website and requires login and redirection, I copied the target page's source code page into a file and saved it as “example.html” in C:\ for the convenience of practicing.

我正在尝试使用 Python 和 BeautifulSoup 在网页上获取信息。由于网页在公司网站，需要登录和重定向，我将目标页面的源代码页复制到一个文件中，保存为“example.html”在C:\，方便练习。

This is a part of the original code:

这是原始代码的一部分：

<tr class="ghj">
    <td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&amp;u=12563">port_new_cape</a></td>
    <td class="position"><a href="./search.php?id=12563&amp;sr=positions" title="Search positions">452</a></td>
    <td class="details"><div>South</div></td>
    <td>May 09, 1997</td>
    <td>Jan 23, 2009 12:05 pm&nbsp;</td>
</tr>

The code I worked out so far is:

到目前为止我制定的代码是：

from bs4 import BeautifulSoup
import re
import urllib2

url = "C:\example.html"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

cities = soup.find_all('span', {'class' : 'city-sh'})

for city in cities:
print city

This is just the first stage of testing, so it's somewhat incomplete.

这只是测试的第一阶段，所以它有点不完整。

However, when I run it, it gives an error message. Seems it's improper to use urllib2.urlopento open a local file.

但是，当我运行它时，它给出了一条错误消息。好像urllib2.urlopen用来打开本地文件不太合适。

 Traceback (most recent call last):
   File "C:\Python27\Testing.py", line 8, in <module>
     page = urllib2.urlopen(url)
   File "C:\Python27\lib\urllib2.py", line 127, in urlopen
     return _opener.open(url, data, timeout)
   File "C:\Python27\lib\urllib2.py", line 404, in open
     response = self._open(req, data)
   File "C:\Python27\lib\urllib2.py", line 427, in _open
     'unknown_open', req)
   File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
     result = func(*args)
   File "C:\Python27\lib\urllib2.py", line 1247, in unknown_open
     raise URLError('unknown url type: %s' % type)
 URLError: <urlopen error unknown url type: c>

How can I practice using a local file?

如何练习使用本地文件？

Answer 1

采纳答案by Mark K

With Chandan's help, the problem has been solved. All the credits shall go to him. :)

在昌丹的帮助下，问题得到了解决。所有的功劳都归他所有。:)

the "urllib2.url" is useless here.

“urllib2.url”在这里没用。

from bs4 import BeautifulSoup
import re
# import urllib2

url = "C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())

cities = soup.find_all('span', {'class' : 'city-sh'})

for city in cities:
    print city

Answer 2

回答by Tanveer Alam

You can try using lxml parser also. Here is an example for your html data.

您也可以尝试使用 lxml 解析器。这是您的 html 数据示例。

from lxml.html import fromstring
import lxml.html as PARSER

data = open('example.html').read()
root = PARSER.fromstring(data)

for ele in root.getiterator():
    if ele.tag == "td":
        print ele.text_content()

o/p: port_new_cape 452 South May 09, 1997 Jan 23, 2009 12:05 pm?

o/p：port_new_cape 452 South 1997 年 5 月 9 日 2009 年 1 月 23 日下午 12:05？

Answer 3

回答by CasualDemon

The best way to open a local file with BeautifulSoup is to pass it an open file handler directly. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

使用 BeautifulSoup 打开本地文件的最佳方法是直接将打开的文件处理程序传递给它。http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("C:\example.html"), "html.parser")

for city in soup.find_all('span', {'class' : 'city-sh'}):
    print(city)

使用Python和BeautifulSoup（将网页源代码保存到本地文件中）

提问by Mark K

采纳答案by Mark K

回答by Tanveer Alam

回答by CasualDemon

相关推荐

最近更新

标签

使用Python和BeautifulSoup（将网页源代码保存到本地文件中）

提问by Mark K

采纳答案by Mark K

回答by Tanveer Alam

回答by CasualDemon

相关推荐

Python matplotlib：分组箱线图

Python Pycharm：为运行 manage.py 任务设置环境变量

python列表理解在一次迭代中产生两个值

在 Python 中使用逗号、连接和字符串格式化程序的区别

相关推荐

最近更新

标签