将 HTML 表放入 Pandas Dataframe,而不是数据框对象列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38486477/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:37:41  来源:igfitidea点击:

Get HTML table into pandas Dataframe, not list of dataframe objects

pythonpandasdataframehtml-parsing

提问by schaefferda

I apologize if this question has been answered elsewhere but I have been unsuccessful in finding a satisfactory answer here or elsewhere.

如果此问题已在其他地方得到解答,我深表歉意,但我未能在此处或其他地方找到令人满意的答案。

I am somewhat new to python and pandas and having some difficulty getting HTML data into a pandas dataframe. In the pandas documentation it says .read_html() returns a list of dataframe objects, so when I try to do some data manipulation to get rid of the some samples I get an error.

我对 python 和 Pandas 有点陌生,并且在将 HTML 数据放入 Pandas 数据帧时遇到了一些困难。在Pandas文档中,它说 .read_html() 返回一个数据帧对象列表,因此当我尝试进行一些数据操作以摆脱某些样本时,我收到错误消息。

Here is my code to read the HTML:

这是我读取 HTML 的代码:

df = pd.read_html('http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)

Then I try to clean it up:

然后我尝试清理它:

df = df.dropna(axis=0, thresh=4)

And I received the following error:

我收到以下错误:

Traceback (most recent call last): File "module4.py", line 25, in
<module> df = df.dropna(axis=0, thresh=4) AttributeError: 'list'
object has no attribute 'dropna'

How do I get this data into an actual dataframe, similar to what .read_csv() does?

如何将这些数据放入实际的数据帧中,类似于 .read_csv() 的作用?

回答by Laurent S

From http://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-read-html, "read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content".

http://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-read-html,“read_html 返回一个 DataFrame 对象列表,即使其中只包含一个表HTML 内容”。

So df = df[0].dropna(axis=0, thresh=4)should do what you want.

所以df = df[0].dropna(axis=0, thresh=4)应该做你想做的。

回答by MiLe

pd.read_html returns you a list with one element and that element is the pandas dataframe, i.e.

pd.read_html 返回一个包含一个元素的列表,该元素是pandas 数据框,即

df = pd.read_html(url) ###<-- List

df[0] ###<-- Pandas DataFrame