pandas 使用pandas读取下载的html文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25056120/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:18:41  来源:igfitidea点击:

Using pandas to read downloaded html file

pythonhtmlimportpandas

提问by lokheart

As title, I tried using read_htmlbut give me the following error:

作为标题,我尝试使用read_html但给我以下错误:

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

What have I done wrong?

我做错了什么?

update 01

更新 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

HTML 在顶部包含一些 javascript,然后是一个 html 表。我使用 R 来处理它,通过 XML 包解析 html 给我一个数据框。我想用 python 来做,我应该在给Pandas之前使用像 beautifulsoup 这样的其他东西吗?

采纳答案by ZJS

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

我认为您通过使用像美丽汤这样的 html 解析器走上了正确的轨道。pandas.read_html() 读取 html 表而不是 html 页面。

You would want to do something like this...

你会想做这样的事情......

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

回答by srana

  1. first of all install below packages for parsing purpose

    • pip install BeautifulSoup4
    • pip install lxml
    • pip install html5lib
  2. then use 'read_html' to read html table on any html page.


    import pandas as pds
    pds_df = pds.read_html('C:/age0.html')
    pds_df[0]
    

  1. 首先安装以下软件包以进行解析

    • pip 安装 BeautifulSoup4
    • pip 安装 lxml
    • pip 安装 html5lib
  2. 然后使用“read_html”读取任何 html 页面上的 html 表。


    import pandas as pds
    pds_df = pds.read_html('C:/age0.html')
    pds_df[0]
    

I hope this will help.

我希望这将有所帮助。

Good Luck!!

祝你好运!!