Python 如何打开.html文件？

Question

提问by david

I have html file called test.htmlit has one word ?????.

我有一个 html 文件，test.html它有一个词?????。

I open the test.html and print it's content using this block of code:

我打开 test.html 并使用以下代码块打印其内容：

file = open("test.html", "r")
print file.read()

but it prints ??????, why this happened and how could I fix it?

但它会打印??????，为什么会发生这种情况，我该如何解决？

BTW. when I open text file it works good.

顺便提一句。当我打开文本文件时，它运行良好。

Edit: I'd tried this:

编辑：我试过这个：

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

Answer 1

采纳答案by vks

import codecs
f=codecs.open("test.html", 'r')
print f.read()

Try something like this.

尝试这样的事情。

Answer 2

回答by Benjamin

You can read HTML page using 'urllib'.

您可以使用“urllib”阅读 HTML 页面。

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

Answer 3

回答by wenzul

Use codecs.openwith the encoding parameter.

使用带有 encoding 参数的codecs.open。

import codecs
f = codecs.open("test.html", 'r', 'utf-8')

Answer 4

回答by Dibin Joseph

you can make use of the following code:

您可以使用以下代码：

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print document

If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:

如果您想删除中间的所有空行并将所有单词作为字符串获取（也避免特殊字符、数字），则还包括：

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

*define stas a stringinitially, like st=""

*定义st为string初始，如st=""

Answer 5

回答by Suresh2692

you can use 'urllib'in python3 same as

你可以在 python3 中使用'urllib'与

https://stackoverflow.com/a/27243244/4815313with few changes.

https://stackoverflow.com/a/27243244/4815313几乎没有变化。

#python3

import urllib

page = urllib.request.urlopen("/path/").read()
print(page)

Answer 6

回答by Chen Mier

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':

我今天也遇到了这个问题。我使用的是Windows，系统语言默认为中文。因此，有人可能会遇到类似的 Unicode 错误。只需添加encoding = 'utf-8'：

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()

Answer 7

回答by SHUBHAM SINGH

CODE:

代码：

import codecs

path="D:\Users\html\abc.html" 
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)

Python 如何打开.html文件？

提问by david

采纳答案by vks

回答by Benjamin

回答by wenzul

回答by Dibin Joseph

回答by Suresh2692

回答by Chen Mier

回答by SHUBHAM SINGH

相关推荐

最近更新

标签

Python 如何打开.html文件？

提问by david

采纳答案by vks

回答by Benjamin

回答by wenzul

回答by Dibin Joseph

回答by Suresh2692

回答by Chen Mier

回答by SHUBHAM SINGH

相关推荐

Python 在Numpy中连接空数组

Python pip 错误：无法识别的命令行选项“-fstack-protector-strong”

Python 提取双引号之间的字符串

如何在python中过滤json数组

相关推荐

最近更新

标签