pandas 断言错误:通过了 22 列,传递的数据有 21 列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40855030/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:31:56  来源:igfitidea点击:

AssertionError: 22 columns passed, passed data had 21 columns

pythonpandas

提问by Aditya Gade

This is my code:

这是我的代码:

from urllib import urlopen
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.basketball-reference.com/draft/NBA_2014.html"
html = urlopen(url)
soup = BeautifulSoup(html)
column_headers = [th.getText() for th in soup.findAll('tr',limit=2)[1].findAll('th')]
data_rows = soup.findAll('tr')[2:]
player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))] #PLAYER DATA 

type(soup)
type(data_rows)

df = pd.DataFrame(player_data,columns=column_headers)

The error seems to occur in the last line.

错误似乎发生在最后一行。

回答by marts

First of all, the error is pretty straight-forward: your column_headerslist has 22 columns, but player_dataentries only have 21. So you need to find which out column is missing and why. Just by visually comparing the entries from the dataframe and the headers list, it appears one of the two first columns is missing. player_data[0][0]returns

首先,错误非常简单:您的column_headers列表有 22 列,但player_data条目只有 21。因此您需要找出缺少哪一列以及原因。仅通过直观地比较数据框和标题列表中的条目,似乎缺少前两列之一。player_data[0][0]回报

1, CLE, Andrew Wiggins, University of Kansas,...but it should be

1, CLE, Andrew Wiggins, University of Kansas,...但应该是

1, 1, CLE, Andrew Wiggins, University of Kansas,...

1, 1, CLE, Andrew Wiggins, University of Kansas,...

The problem is the table itself. Navigate to the website, hover over the table and right-click: inspect.

问题是桌子本身。导航到该网站,将鼠标悬停在表格上并右键单击:inspect。

The first row of data (underneath the 'Rk') consists of 21 tdand 1 thelement. The "rk" entry is actually of type thand not td:

第一行数据(在“Rk”下方)由 21td和 1 个th元素组成。“rk”条目实际上是类型th而不是td

Screenshot of table of provided data

提供的数据表的屏幕截图

That is why your

这就是为什么你的

player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))] 

skips the first column because it is only iterating over tdelements. Hence the different length. I don't know how important the first column is; quick fix would be to drop the Rk column from your headers list.

跳过第一列,因为它只迭代td元素。因此长度不同。我不知道第一列有多重要;快速解决方法是从标题列表中删除 Rk 列。

Alternatively, search for both tdand thelements:

另外,搜索tdth元素

player_data = [[td.getText() for td in data_rows[i].findAll(['td','th'])] for i in range(len(data_rows))]