pandas 使用 BeautifulSoup 将表格抓取到数据框中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50633050/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 15:50:43  来源:igfitidea点击:

Scrape tables into dataframe with BeautifulSoup

pandasdataframeweb-scrapingbeautifulsoup

提问by Alex

I'm trying to scrape the data from the coins catalog.

我正在尝试从硬币目录中抓取数据。

There is one of the pages. I need to scrape this datainto Dataframe

其中一页。我需要将这些数据刮到 Dataframe 中

So far I have this code:

到目前为止,我有这个代码:

import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/view/45518').read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    print(row)                    # I need to save this data instead of printing it 

It produces following output:

它产生以下输出:

[]
['', '', '1882', '', '108,000', 'UNC', '—']
[' ', '', '1883', '', '786,000', 'UNC', '~ .99']
[' ', " \n\n\n\n\t\t\t\t\t\t\t$('subGraph55337').on('click', function(event) {\n\t\t\t\t\t\t\t\tLightview.show({\n\t\t\t\t\t\t\t\t\thref : '/en/catalog/ajax/subgraph?id=55337',\n\t\t\t\t\t\t\t\t\trel : 'ajax',\n\t\t\t\t\t\t\t\t\toptions : {\n\t\t\t\t\t\t\t\t\t\tautosize : true,\n\t\t\t\t\t\t\t\t\t\ttopclose : true,\n\t\t\t\t\t\t\t\t\t\tajax : {\n\t\t\t\t\t\t\t\t\t\t\tevalScripts : true\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t} \n\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\tevent.stop();\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t", '1884', '', '4,604,000', 'UNC', '~ .08–.47']
[' ', '', '1885', '', '1,314,000', 'UNC', '~ .20']
['', '', '1886', '', '444,000', 'UNC', '—']
[' ', '', '1888', '', '413,000', 'UNC', '~ .88']
[' ', '', '1889', '', '568,000', 'UNC', '~ .56']
[' ', " \n\n\n\n\t\t\t\t\t\t\t$('subGraph55342').on('click', function(event) {\n\t\t\t\t\t\t\t\tLightview.show({\n\t\t\t\t\t\t\t\t\thref : '/en/catalog/ajax/subgraph?id=55342',\n\t\t\t\t\t\t\t\t\trel : 'ajax',\n\t\t\t\t\t\t\t\t\toptions : {\n\t\t\t\t\t\t\t\t\t\tautosize : true,\n\t\t\t\t\t\t\t\t\t\ttopclose : true,\n\t\t\t\t\t\t\t\t\t\tajax : {\n\t\t\t\t\t\t\t\t\t\t\tevalScripts : true\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t} \n\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\tevent.stop();\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t", '1890', '', '2,137,000', 'UNC', '~ .28–.79']
['', '', '1891', '', '605,000', 'UNC', '—']
[' ', '', '1892', '', '205,000', 'UNC', '~ .47']
[' ', '', '1893', '', '754,000', 'UNC', '~ .79']
[' ', '', '1894', '', '532,000', 'UNC', '~ .20']
[' ', '', '1895', '', '423,000', 'UNC', '~ .40']
['', '', '1896', '', '174,000', 'UNC', '—']

But when I'm trying to save it to Dataframe and export to excel it contains just the last value:

但是当我尝试将它保存到 Dataframe 并导出到 excel 时,它只包含最后一个值:

         0
0         
1         
2     1896
3         
4  174,000
5      UNC
6        —

回答by phi

Try this

尝试这个

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=["A", "B", ...])

回答by ttfreeman

Pandas already has a built-in method to convert the table on the web to a dataframe:

Pandas 已经有一个内置方法可以将网络上的表格转换为数据框:

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

回答by Rakesh

Try:

尝试:

import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)


df = pd.DataFrame(res, columns=["Year", "Mintage", "Quality", "Price"])
print(df)

Output:

输出:

   Year  Mintage Quality    Price
0  1882  108,000     UNC        —
1  1883  786,000     UNC  ~ .03

回答by karina275

Just a head's up... This part of Rakesh's code means that only HTML rows containing text will be included in the dataframe, as the rows don't get appended if row is an empty list:

只是抬头...... Rakesh 代码的这部分意味着只有包含文本的 HTML 行才会包含在数据框中,因为如果 row 是空列表,则不会附加这些行:

if row:
    res.append(row)

Problematic in my use case, where I wanted to compare row indexing for the HTML and dataframe tables later on. I just needed to change it to:

在我的用例中存在问题,我想稍后比较 HTML 和数据帧表的行索引。我只需要将其更改为:

res.append(row)

Also, if a cell in the row is empty, it doesn't get included. This then messes up the columns. So I changed

此外,如果行中的单元格为空,则不会包含在内。这会弄乱列。所以我改变了

row = [tr.text.strip() for tr in td if tr.text.strip()]

to

row = [d.text.strip() for d in td]

But, otherwise, it's working for me. Thanks :)

但是,否则,它对我有用。谢谢 :)