Python 如何从html页面中提取文本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33566843/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract text from html page?
提问by Nique
For example the web page is the link:
例如网页是链接:
https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50
https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50
I must have the name of the firms and their address and website. I have tried the following to convert the html to text:
我必须有这些公司的名称及其地址和网站。我尝试了以下将 html 转换为文本:
import nltk
from urllib import urlopen
url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
But it returns the error:
但它返回错误:
ImportError: cannot import name 'urlopen
采纳答案by JRodDynamite
Peter Wood has answered your problem (link).
彼得伍德已经回答了您的问题(链接)。
import urllib.request
uf = urllib.request.urlopen(url)
html = uf.read()
But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.
但是,如果您想提取数据(例如公司名称、地址和网站),则需要获取 HTML 源代码并使用 HTML 解析器对其进行解析。
I'd suggest to use requests
for fetching the HTML source and BeautifulSoup
to parse the HTML generated and extract the text you require.
我建议requests
用于获取 HTML 源代码并BeautifulSoup
解析生成的 HTML 并提取您需要的文本。
Here is a small snipet which will give you a head start.
这是一个小 snipet,它将为您提供一个良好的开端。
import requests
from bs4 import BeautifulSoup
link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"
html = requests.get(link).text
"""If you do not want to use requests then you can use the following code below
with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
print("Company Name: " + r.find('a').text)
print("Address: " + r.find("div", {'class': 'address'}).text)
print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)