python BeautifulSoup解析表

Question

提问by Cmag

I'm learning python requestsand BeautifulSoup. For an exercise, I've chosen to write a quick NYC parking ticket parser. I am able to get an html response which is quite ugly. I need to grab the lineItemsTableand parse all the tickets.

我正在学习 pythonrequests和 BeautifulSoup。作为练习，我选择编写一个快速的纽约停车罚单解析器。我能够得到一个非常难看的 html 响应。我需要抓住lineItemsTable并解析所有的票。

You can reproduce the page by going here: https://paydirect.link2gov.com/NYCParking-Plate/ItemSearchand entering a NYplate T630134C

您可以通过转到此处重现该页面：https://paydirect.link2gov.com/NYCParking-Plate/ItemSearch并输入一个NY盘子T630134C

soup = BeautifulSoup(plateRequest.text)
#print(soup.prettify())
#print soup.find_all('tr')

table = soup.find("table", { "class" : "lineItemsTable" })
for row in table.findAll("tr"):
    cells = row.findAll("td")
    print cells

Can someone please help me out? Simple looking for all trdoes not get me anywhere.

有人可以帮我吗？简单地寻找所有tr并没有让我到任何地方。

Answer 1

采纳答案by shaktimaan

Here you go:

干得好：

data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

This gives you:

这给你：

[ [u'1359711259', u'SRF', u'08/05/2013', u'5310 4 AVE', u'K', u'19', u'125.00', u'$'], 
  [u'7086775850', u'PAS', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'125.00', u'$'], 
  [u'7355010165', u'OMT', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'145.00', u'$'], 
  [u'4002488755', u'OMT', u'02/12/2014', u'NB 1ST AVE @ E 23RD ST', u'5', u'115.00', u'$'], 
  [u'7913806837', u'OMT', u'03/03/2014', u'5015 4th Ave', u'K', u'46', u'115.00', u'$'], 
  [u'5080015366', u'OMT', u'03/10/2014', u'EB 65TH ST @ 16TH AV E', u'7', u'50.00', u'$'], 
  [u'7208770670', u'OMT', u'04/08/2014', u'333 15th St', u'K', u'70', u'65.00', u'$'], 
  [u'table = soup.find("table", { "class" : "lineItemsTable" })
for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 9:
        summons = cells[1].find(text=True)
        plateType = cells[2].find(text=True)
        vDate = cells[3].find(text=True)
        location = cells[4].find(text=True)
        borough = cells[5].find(text=True)
        vCode = cells[6].find(text=True)
        amount = cells[7].find(text=True)
        print amount
.00\n\n\nPayment Amount:']
]

Couple of things to note:

需要注意的几点：

The last row in the output above, the Payment Amount is not a part of the table but that is how the table is laid out. You can filter it out by checking if the length of the list is less than 7.
The last column of every row will have to be handled separately since it is an input text box.

上面输出的最后一行，付款金额不是表格的一部分，但表格的布局方式是这样的。您可以通过检查列表的长度是否小于 7 来过滤掉它。
每行的最后一列必须单独处理，因为它是一个输入文本框。

Answer 2

回答by Cmag

Solved, this is how your parse their html results:

解决了，这就是你解析他们的 html 结果的方式：

htmltable = soup.find('table', { 'class' : 'table table-striped' })
# where the dictionary specify unique attributes for the 'table' tag

Answer 3

回答by eusoubrasileiro

Here is working example for a generic <table>. (question links-broken)

这是一个通用的工作示例<table>。（问题链接已断开）

Extracting the table from herecountries by GDP (Gross Domestic Product).

按 GDP（国内生产总值）从此处提取国家/地区的表格。

def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows

The tableDataTextfunction parses a html segment started with tag <table>followed by multiple <tr>(table rows) and inner <td>(table data) tags. It returns a list of rows with inner columns. Accepts only one <th>(table header/data) in the first row.

该tableDataText函数解析以标签<table>开头的 html 段，后跟多个<tr>（表格行）和内部<td>（表格数据）标签。它返回带有内部列的行列表。<th>第一行只接受一个（表头/数据）。

list_table = tableDataText(htmltable)
list_table[:2]

[['Rank',
  'Name',
  "GDP (IMF '19)",
  "GDP (UN '16)",
  'GDP Per Capita',
  '2019 Population'],
 ['1',
  'United States',
  '21.41 trillion',
  '18.62 trillion',
  ',064',
  '329,064,917']]

Using it we get (first two rows).

使用它我们得到（前两行）。

import pandas as pd
dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(4)

That can be easily transformed in a pandas.DataFramefor more advanced tools.

这可以轻松转换pandas.DataFrame为更高级的工具。

import pandas as pd
import requests

url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"

r = requests.get(url)
df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
df = df_list[0]
df.head()

Answer 4

回答by Poudel

Update: 2020

更新：2020

If a programmar is interested only parsing table from webpage, they can utilize the pandas method pandas.read_html.

如果程序员只对网页中的解析表感兴趣，他们可以使用 pandas 方法pandas.read_html。

Let's say we want to extract the GDP data table from the website: https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries

假设我们要从网站中提取 GDP 数据表：https: //worldpopulationreview.com/countries/countries-by-gdp/#worldCountries

Then following codes does the job perfectly (No need of beautifulsoup and fancy html):

然后下面的代码完美地完成了这项工作（不需要beautifulsoup和花哨的html）：

##代码##

python BeautifulSoup解析表

提问by Cmag

采纳答案by shaktimaan

回答by Cmag

回答by eusoubrasileiro

回答by Poudel

Update: 2020

更新：2020

Output

输出

相关推荐

最近更新

标签

python BeautifulSoup解析表

提问by Cmag

采纳答案by shaktimaan

回答by Cmag

回答by eusoubrasileiro

回答by Poudel

Update: 2020

更新：2020

Output

输出

相关推荐

在 python 中打印字符串的特定部分

Python 导入opencv并获取numpy.core.multiarray导入失败

如何使用Python将网页转换为PDF

Python 绘制文档 tfidf 二维图

相关推荐

最近更新

标签