Python 如何从html页面中提取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33566843/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:34:15  来源:igfitidea点击:

How to extract text from html page?

pythonhtmlpython-3.xtext

提问by Nique

For example the web page is the link:

例如网页是链接:

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

I must have the name of the firms and their address and website. I have tried the following to convert the html to text:

我必须有这些公司的名称及其地址和网站。我尝试了以下将 html 转换为文本:

import nltk   
from urllib import urlopen

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

But it returns the error:

但它返回错误:

ImportError: cannot import name 'urlopen

采纳答案by JRodDynamite

Peter Wood has answered your problem (link).

彼得伍德已经回答了您的问题(链接)。

import urllib.request

uf = urllib.request.urlopen(url)
html = uf.read()

But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.

但是,如果您想提取数据(例如公司名称、地址和网站),则需要获取 HTML 源代码并使用 HTML 解析器对其进行解析。

I'd suggest to use requestsfor fetching the HTML source and BeautifulSoupto parse the HTML generated and extract the text you require.

我建议requests用于获取 HTML 源代码并BeautifulSoup解析生成的 HTML 并提取您需要的文本。

Here is a small snipet which will give you a head start.

这是一个小 snipet,它将为您提供一个良好的开端。

import requests
from bs4 import BeautifulSoup

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"

html = requests.get(link).text

"""If you do not want to use requests then you can use the following code below 
   with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
    print("Company Name: " + r.find('a').text)
    print("Address: " + r.find("div", {'class': 'address'}).text)
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)