Python 如何打开.html文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27243129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to open html file?
提问by david
I have html file called test.html
it has one word ?????
.
我有一个 html 文件,test.html
它有一个词?????
。
I open the test.html and print it's content using this block of code:
我打开 test.html 并使用以下代码块打印其内容:
file = open("test.html", "r")
print file.read()
but it prints ??????
, why this happened and how could I fix it?
但它会打印??????
,为什么会发生这种情况,我该如何解决?
BTW. when I open text file it works good.
顺便提一句。当我打开文本文件时,它运行良好。
Edit: I'd tried this:
编辑:我试过这个:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
采纳答案by vks
import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.
尝试这样的事情。
回答by Benjamin
You can read HTML page using 'urllib'.
您可以使用“urllib”阅读 HTML 页面。
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page
回答by wenzul
Use codecs.openwith the encoding parameter.
使用带有 encoding 参数的codecs.open。
import codecs
f = codecs.open("test.html", 'r', 'utf-8')
回答by Dibin Joseph
you can make use of the following code:
您可以使用以下代码:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print document
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
如果您想删除中间的所有空行并将所有单词作为字符串获取(也避免特殊字符、数字),则还包括:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st
as a string
initially, like st=""
*定义st
为string
初始,如st=""
回答by Suresh2692
you can use 'urllib'in python3 same as
你可以在 python3 中使用'urllib'与
https://stackoverflow.com/a/27243244/4815313with few changes.
https://stackoverflow.com/a/27243244/4815313几乎没有变化。
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)
回答by Chen Mier
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8'
:
我今天也遇到了这个问题。我使用的是Windows,系统语言默认为中文。因此,有人可能会遇到类似的 Unicode 错误。只需添加encoding = 'utf-8'
:
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()
回答by SHUBHAM SINGH
CODE:
代码:
import codecs
path="D:\Users\html\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)