pandas 当 <tr> 有 rowspan 时我该怎么办

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28763891/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:59:35  来源:igfitidea点击:

What should I do when <tr> has rowspan

pythonhtmlpandasbeautifulsoup

提问by Divya Jose

If the row has rowspan element , how to make the row correspond to the table as in wikipedia page.

如果该行具有 rowspan 元素,则如何使该行与维基百科页面中的表格相对应。

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[6]
except AttributeError as e:
    print 'No tables found, exiting'

try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
    print 'No table row found, exiting'


headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]


df = pd.DataFrame(data=results, columns=headers)
df


I get the table as the output.. but for tables where the row contains rowspan -i get table as follows - enter image description here

我得到的表作为输出.. 但是对于行包含行跨度的-我得到的表如下 - enter image description here

采纳答案by Vivek Sable

The problem due to following case , as you know,

由于以下情况引起的问题,如您所知,

html content:

html内容:

<tr>
     <td rowspan="2">2=</td>
     <td>West Indies</td>
     <td>4</td>
     <td>Lord's</td>
     <td>2009</td>
</tr>
<tr>
     <td style="text-align:left;">India</td>
     <td>4</td>
     <td>Mumbai</td>
      <td>2012</td>
</tr>

so when tdhave rowspanattribute then consider that same tdvaulue is repeated for next trat same level and the value of rowspanmeans for next number of trtags.

因此,当td具有rowspan属性时,请考虑在相同级别的td下一个重复tr相同的值,rowspan以及下一个tr标签数量的均值。

  1. Get all such rowspaninformation and save in variable. Save sequence number of trtag , sequence number of tdtag , value of rowspani.e. how many trtags have same td, the text value of td.
  2. Update result of all traccording to above method.
  1. 获取所有这些rowspan信息并保存在变量中。保存tr标签序号、标签序号tdrowspan即有多少个tr标签相同td的值、td.的文本值。
  2. tr根据上述方法更新所有结果。

Note:: checked only given test case. Need to check some more test case.

注意:: 只检查给定的测试用例。需要检查更多的测试用例。

code:

代码:

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd


wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)

soup = BeautifulSoup(page)

table = soup.find_all('table')[6]

tmp = table.find_all('tr')

first = tmp[0]
allRows = tmp[1:-1]
#table.find_all('tr')[1:-1]


headers = [header.get_text() for header in first.find_all('th')]

results = [[data.get_text() for data in row.find_all('td')] for row in allRows]

#<td rowspan="2">2=</td>
# list of tuple (Level of tr, Level of td, total Count, Text Value)
#e.g.
#[(1, 0, 2, u'2=')]
# (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)
rowspan = []

for no, tr in enumerate(allRows):
    tmp = []
    for td_no, data in enumerate(tr.find_all('td')):
        print  data.has_key("rowspan")
        if data.has_key("rowspan"):
            rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))


if rowspan:
    for i in rowspan:
        # tr value of rowspan in present in 1th place in results
        for j in xrange(1, i[2]):
            #- Add value in next tr.
            results[i[0]+j].insert(i[1], i[3])


df = pd.DataFrame(data=results, columns=headers)
print df

output:

输出:

  Rank       Opponent No. wins Most recent venue Season
0    1  ?South Africa        6            Lord's   1951
1   2=   ?West Indies        4            Lord's   2009
2   2=         ?India        4            Mumbai   2012
3    4     ?Australia        3            Sydney   1932
4    5      ?Pakistan        2      Trent Bridge   1967
5    6     ?Sri Lanka        1      Old Trafford   2002


working to table 10 also

工作到表 10 也

  Rank Hundreds            Player Matches Innings Average
0    1       25     Alastair Cook     107     191   45.61
1    2       23   Kevin Pietersen     104     181   47.28
2    3       22     Colin Cowdrey     114     188   44.07
3    3       22     Wally Hammond      85     140   58.46
4    3       22  Geoffrey Boycott     108     193   47.72
5    6       21    Andrew Strauss     100     178   40.91
6    6       21          Ian Bell     103     178   45.30
7   8=       20    Ken Barrington      82     131   58.67
8   8=       20      Graham Gooch     118     215   42.58
9   10       19        Len Hutton      79     138   56.67

回答by Gene Burinsky

None of the parsers found across stackoverflow or across the web worked for me - they all parsed my tables from Wikipedia incorrectly. So here you go, a parser that actually works and is simple. Cheers.

在 stackoverflow 或网络上找到的解析器都没有对我来说有效 - 他们都错误地解析了我从维基百科中的表格。所以你开始了,一个实际工作并且很简单的解析器。干杯。

Define the parser functions:

定义解析器函数:

def pre_process_table(table):
    """
    INPUT:
        1. table - a bs4 element that contains the desired table: ie <table> ... </table>
    OUTPUT:
        a tuple of: 
            1. rows - a list of table rows ie: list of <tr>...</tr> elements
            2. num_rows - number of rows in the table
            3. num_cols - number of columns in the table
    Options:
        include_td_head_count - whether to use only th or th and td to count number of columns (default: False)
    """
    rows = [x for x in table.find_all('tr')]

    num_rows = len(rows)

    # get an initial column count. Most often, this will be accurate
    num_cols = max([len(x.find_all(['th','td'])) for x in rows])

    # sometimes, the tables also contain multi-colspan headers. This accounts for that:
    header_rows_set = [x.find_all(['th', 'td']) for x in rows if len(x.find_all(['th', 'td']))>num_cols/2]

    num_cols_set = []

    for header_rows in header_rows_set:
        num_cols = 0
        for cell in header_rows:
            row_span, col_span = get_spans(cell)
            num_cols+=len([cell.getText()]*col_span)

        num_cols_set.append(num_cols)

    num_cols = max(num_cols_set)

    return (rows, num_rows, num_cols)


def get_spans(cell):
        """
        INPUT:
            1. cell - a <td>...</td> or <th>...</th> element that contains a table cell entry
        OUTPUT:
            1. a tuple with the cell's row and col spans
        """
        if cell.has_attr('rowspan'):
            rep_row = int(cell.attrs['rowspan'])
        else: # ~cell.has_attr('rowspan'):
            rep_row = 1
        if cell.has_attr('colspan'):
            rep_col = int(cell.attrs['colspan'])
        else: # ~cell.has_attr('colspan'):
            rep_col = 1 

        return (rep_row, rep_col)

def process_rows(rows, num_rows, num_cols):
    """
    INPUT:
        1. rows - a list of table rows ie <tr>...</tr> elements
    OUTPUT:
        1. data - a Pandas dataframe with the html data in it
    """
    data = pd.DataFrame(np.ones((num_rows, num_cols))*np.nan)
    for i, row in enumerate(rows):
        try:
            col_stat = data.iloc[i,:][data.iloc[i,:].isnull()].index[0]
        except IndexError:
            print(i, row)

        for j, cell in enumerate(row.find_all(['td', 'th'])):
            rep_row, rep_col = get_spans(cell)

            #print("cols {0} to {1} with rep_col={2}".format(col_stat, col_stat+rep_col, rep_col))
            #print("\trows {0} to {1} with rep_row={2}".format(i, i+rep_row, rep_row))

            #find first non-na col and fill that one
            while any(data.iloc[i,col_stat:col_stat+rep_col].notnull()):
                col_stat+=1

            data.iloc[i:i+rep_row,col_stat:col_stat+rep_col] = cell.getText()
            if col_stat<data.shape[1]-1:
                col_stat+=rep_col

    return data

def main(table):
    rows, num_rows, num_cols = pre_process_table(table)
    df = process_rows(rows, num_rows, num_cols)
    return(df)

Here's an example of how one would use the above code on this Wisconsindata. Suppose it's already in a bs4soup then...

这是一个如何在威斯康星州数据上使用上述代码的示例。假设它已经在bs4汤里了,那么......

## Find tables on the page and locate the desired one:
tables = soup.findAll("table", class_='wikitable')

## I want table 3 or the one that contains years 2000-2018
table = tables[3]

## run the above functions to extract the data
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)

My parser above will accurately parse tables such as the ones here, while all others fail to recreate the tables at numerous points.

我上面的解析器将准确地解析诸如此处的表,而所有其他解析器都无法在许多点重新创建表。

In case of simple cases - simpler solution

在简单情况下 - 更简单的解决方案

There may be a simpler solution to the above issue if it's a pretty well-formatted table with rowspanattributes. Pandashas a fairly robust read_htmlfunction that can parse the provided htmltables and seems to handle rowspanfairly well(couldn't parse the Wisconsin stuff). fillna(method='ffill')can then populate the unpopulated rows. Note that this does not necessarily work across column spaces. Also note that cleanup will be necessary after.

如果它是带有rowspan属性的格式良好的表,则可能有一个更简单的解决方案来解决上述问题。Pandas有一个相当强大的read_html功能,可以解析提供的html表,并且似乎处理rowspan得相当好(无法解析威斯康星州的东西)。fillna(method='ffill')然后可以填充未填充的行。请注意,这不一定适用于列空间。另请注意,之后需要进行清理。

Consider the html code:

考虑 html 代码:

    s = """<table width="100%" border="1">
    <tr>
        <td rowspan="1">one</td>
        <td rowspan="2">two</td>
        <td rowspan="3">three</td>
    </tr>
    <tr><td>"4"</td></tr>
    <tr>
        <td>"55"</td>
        <td>"99"</td>
    </tr>
    </table>
    """

In order to process it into the requested output, just do:

为了将其处理为请求的输出,只需执行以下操作:

In [16]: df = pd.read_html(s)[0]

In [29]: df
Out[29]:
      0     1      2
0   one   two  three
1   "4"   NaN    NaN
2  "55"  "99"    NaN

Then to fill the NAs,

然后填充NA,

In [30]: df.fillna(method='ffill')
Out[30]:
      0     1      2
0   one   two  three
1   "4"   two  three
2  "55"  "99"  three

回答by joelostblom

pandas >= 0.24.0 understands colspanand rowspanattributes, as documented in the release notes. To extract the wikipage table that were giving you issues previously, the following works.

pandas >= 0.24.0 理解colspanrowspan属性,如发行说明中所述。要提取之前给您带来问题的 wikipage 表格,请执行以下操作。

import pandas as pd


# Extract all tables from the wikipage
dfs = pd.read_html("http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records")
# The table referenced above is the 7th on the wikipage
df = dfs[6]
# The last row is just the date of the last update
df = df.iloc[:-1]

Out:

出去:

   Rank  Victories    Opposition                                 Most recent venue              Date
0     1          6  South Africa                           Lord's, London, England      21 June 1951
1    =2          4         India                   Wankhede Stadium, Mumbai, India  23 November 2012
2    =2          4   West Indies                           Lord's, London, England        6 May 2009
3     4          3     Australia          Sydney Cricket Ground, Sydney, Australia   2 December 1932
4     5          2      Pakistan                 Trent Bridge, Nottingham, England    10 August 1967
5     6          1     Sri Lanka  Old Trafford Cricket Ground, Manchester, England      13 June 2002

回答by u6805334

input:

输入:

<html>
<body>

<table width="100%" border="1">
    <tr>
        <td rowspan="2">one</td>
        <td>two</td>
        <td>three</td>
    </tr>
    <tr>
        <td colspan="2">February</td>
    </tr>
</table>

</body>
</html>

output:

输出:

one  two         three
one  February    February

python code:

蟒蛇代码:

# !/bin/python3
# coding: utf-8
from bs4 import BeautifulSoup


class Element(object):
    def __init__(self, row, col, text, rowspan=1, colspan=1):
        self.row = row
        self.col = col
        self.text = text
        self.rowspan = rowspan
        self.colspan = colspan

    def __repr__(self):
        return f'''{{"row": {self.row}, "col":  {self.col}, "text": {self.text}, "rowspan": {self.rowspan}, "colspan": {self.colspan}}}'''

    def isRowspan(self):
        return self.rowspan > 1

    def isColspan(self):
        return self.colspan > 1


def parse(h) -> [[]]:
    doc = BeautifulSoup(h, 'html.parser')

    trs = doc.select('tr')

    m = []

    for row, tr in enumerate(trs):  # collect Node, rowspan node, colspan node
        it = []
        ts = tr.find_all(['th', 'td'])
        for col, tx in enumerate(ts):
            element = Element(row, col, tx.text.strip())
            if tx.has_attr('rowspan'):
                element.rowspan = int(tx['rowspan'])
            if tx.has_attr('colspan'):
                element.colspan = int(tx['colspan'])
            it.append(element)
        m.append(it)

    def solveColspan(ele):
        row, col, text, rowspan, colspan = ele.row, ele.col, ele.text, ele.rowspan, ele.colspan
        m[row].insert(col + 1, Element(row, col, text, rowspan, colspan - 1))
        for column in range(col + 1, len(m[row])):
            m[row][column].col += 1

    def solveRowspan(ele):
        row, col, text, rowspan, colspan = ele.row, ele.col, ele.text, ele.rowspan, ele.colspan
        offset = row + 1
        m[offset].insert(col, Element(offset, col, text, rowspan - 1, 1))
        for column in range(col + 1, len(m[offset])):
            m[offset][column].col += 1

    for row in m:
        for ele in row:
            if ele.isColspan():
                solveColspan(ele)
            if ele.isRowspan():
                solveRowspan(ele)
    return m


def prettyPrint(m):
    for i in m:
        it = [f'{len(i)}']
        for index, j in enumerate(i):
            if j.text != '':
                it.append(f'{index:2} {j.text[:4]:4}')
        print(' --- '.join(it))


with open('./index.html', 'rb') as f:
    index = f.read()
html = index.decode('utf-8')
matrix = parse(html)
prettyPrint(matrix)