pandas 使用pandas读取下载的html文件

Question

提问by lokheart

As title, I tried using read_htmlbut give me the following error:

作为标题，我尝试使用read_html但给我以下错误：

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

What have I done wrong?

我做错了什么？

update 01

更新 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

HTML 在顶部包含一些 javascript，然后是一个 html 表。我使用 R 来处理它，通过 XML 包解析 html 给我一个数据框。我想用 python 来做，我应该在给Pandas之前使用像 beautifulsoup 这样的其他东西吗？

Answer 1

采纳答案by ZJS

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

我认为您通过使用像美丽汤这样的 html 解析器走上了正确的轨道。pandas.read_html() 读取 html 表而不是 html 页面。

You would want to do something like this...

你会想做这样的事情......

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

Answer 2

回答by srana

first of all install below packages for parsing purpose
- pip install BeautifulSoup4
- pip install lxml
- pip install html5lib

then use 'read_html' to read html table on any html page.

import pandas as pds
pds_df = pds.read_html('C:/age0.html')
pds_df[0]

首先安装以下软件包以进行解析
- pip 安装 BeautifulSoup4
- pip 安装 lxml
- pip 安装 html5lib

然后使用“read_html”读取任何 html 页面上的 html 表。

import pandas as pds
pds_df = pds.read_html('C:/age0.html')
pds_df[0]

I hope this will help.

我希望这将有所帮助。

Good Luck!!

祝你好运！！

pandas 使用pandas读取下载的html文件

提问by lokheart

update 01

更新 01

采纳答案by ZJS

回答by srana

相关推荐

最近更新

标签

pandas 使用pandas读取下载的html文件

提问by lokheart

update 01

更新 01

采纳答案by ZJS

回答by srana

相关推荐

Pandas：如何迭代两个格式完全相同的数据帧？

Python Pandas 使用索引或列标识符连接/合并数据帧

基于三列将一个 Pandas 数据帧中的行与另一行匹配

pandas 如何通过多次重复系列来创建数据框？

相关推荐

最近更新

标签