Python 如何从html页面中提取文本？

Question

提问by Nique

For example the web page is the link:

例如网页是链接：

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

I must have the name of the firms and their address and website. I have tried the following to convert the html to text:

我必须有这些公司的名称及其地址和网站。我尝试了以下将 html 转换为文本：

import nltk   
from urllib import urlopen

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

But it returns the error:

但它返回错误：

ImportError: cannot import name 'urlopen

Answer 1

采纳答案by JRodDynamite

Peter Wood has answered your problem (link).

彼得伍德已经回答了您的问题（链接）。

import urllib.request

uf = urllib.request.urlopen(url)
html = uf.read()

But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.

但是，如果您想提取数据（例如公司名称、地址和网站），则需要获取 HTML 源代码并使用 HTML 解析器对其进行解析。

I'd suggest to use requestsfor fetching the HTML source and BeautifulSoupto parse the HTML generated and extract the text you require.

我建议requests用于获取 HTML 源代码并BeautifulSoup解析生成的 HTML 并提取您需要的文本。

Here is a small snipet which will give you a head start.

这是一个小 snipet，它将为您提供一个良好的开端。

import requests
from bs4 import BeautifulSoup

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"

html = requests.get(link).text

"""If you do not want to use requests then you can use the following code below 
   with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
    print("Company Name: " + r.find('a').text)
    print("Address: " + r.find("div", {'class': 'address'}).text)
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

Python 如何从html页面中提取文本？

提问by Nique

采纳答案by JRodDynamite

相关推荐

最近更新

标签

Python 如何从html页面中提取文本？

提问by Nique

采纳答案by JRodDynamite

相关推荐

Python 如何修复：TypeError 'tuple' 对象不支持项目分配

Python 我如何在 GPU 上运行 theano

用于获取 t 统计量的 Python 函数

在 python 中从 VideoCapture opencv 获取特定帧

相关推荐

最近更新

标签