Python 如何打开.html文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27243129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:33:50  来源:igfitidea点击:

How to open html file?

pythonpython-2.7character-encoding

提问by david

I have html file called test.htmlit has one word ?????.

我有一个 html 文件,test.html它有一个词?????

I open the test.html and print it's content using this block of code:

我打开 test.html 并使用以下代码块打印其内容:

file = open("test.html", "r")
print file.read()

but it prints ??????, why this happened and how could I fix it?

但它会打印??????,为什么会发生这种情况,我该如何解决?

BTW. when I open text file it works good.

顺便提一句。当我打开文本文件时,它运行良好。

Edit: I'd tried this:

编辑:我试过这个:

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

采纳答案by vks

import codecs
f=codecs.open("test.html", 'r')
print f.read()

Try something like this.

尝试这样的事情。

回答by Benjamin

You can read HTML page using 'urllib'.

您可以使用“urllib”阅读 HTML 页面。

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

回答by wenzul

Use codecs.openwith the encoding parameter.

使用带有 encoding 参数的codecs.open

import codecs
f = codecs.open("test.html", 'r', 'utf-8')

回答by Dibin Joseph

you can make use of the following code:

您可以使用以下代码:

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print document

If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:

如果您想删除中间的所有空行并将所有单词作为字符串获取(也避免特殊字符、数字),则还包括:

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

*define stas a stringinitially, like st=""

*定义ststring初始,如st=""

回答by Suresh2692

you can use 'urllib'in python3 same as

你可以在 python3 中使用'urllib'

https://stackoverflow.com/a/27243244/4815313with few changes.

https://stackoverflow.com/a/27243244/4815313几乎没有变化。

#python3

import urllib

page = urllib.request.urlopen("/path/").read()
print(page)

回答by Chen Mier

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':

我今天也遇到了这个问题。我使用的是Windows,系统语言默认为中文。因此,有人可能会遇到类似的 Unicode 错误。只需添加encoding = 'utf-8'

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()

回答by SHUBHAM SINGH

CODE:

代码:

import codecs

path="D:\Users\html\abc.html" 
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)