pandas 使用 BeautifulSoup 将表格抓取到数据框中

Question

提问by Alex

I'm trying to scrape the data from the coins catalog.

我正在尝试从硬币目录中抓取数据。

There is one of the pages. I need to scrape this datainto Dataframe

有其中一页。我需要将这些数据刮到 Dataframe 中

So far I have this code:

到目前为止，我有这个代码：

import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/view/45518').read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    print(row)                    # I need to save this data instead of printing it

It produces following output:

它产生以下输出：

[]
['', '', '1882', '', '108,000', 'UNC', '—']
[' ', '', '1883', '', '786,000', 'UNC', '~ .99']
[' ', " \n\n\n\n\t\t\t\t\t\t\t$('subGraph55337').on('click', function(event) {\n\t\t\t\t\t\t\t\tLightview.show({\n\t\t\t\t\t\t\t\t\thref : '/en/catalog/ajax/subgraph?id=55337',\n\t\t\t\t\t\t\t\t\trel : 'ajax',\n\t\t\t\t\t\t\t\t\toptions : {\n\t\t\t\t\t\t\t\t\t\tautosize : true,\n\t\t\t\t\t\t\t\t\t\ttopclose : true,\n\t\t\t\t\t\t\t\t\t\tajax : {\n\t\t\t\t\t\t\t\t\t\t\tevalScripts : true\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t} \n\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\tevent.stop();\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t", '1884', '', '4,604,000', 'UNC', '~ .08–.47']
[' ', '', '1885', '', '1,314,000', 'UNC', '~ .20']
['', '', '1886', '', '444,000', 'UNC', '—']
[' ', '', '1888', '', '413,000', 'UNC', '~ .88']
[' ', '', '1889', '', '568,000', 'UNC', '~ .56']
[' ', " \n\n\n\n\t\t\t\t\t\t\t$('subGraph55342').on('click', function(event) {\n\t\t\t\t\t\t\t\tLightview.show({\n\t\t\t\t\t\t\t\t\thref : '/en/catalog/ajax/subgraph?id=55342',\n\t\t\t\t\t\t\t\t\trel : 'ajax',\n\t\t\t\t\t\t\t\t\toptions : {\n\t\t\t\t\t\t\t\t\t\tautosize : true,\n\t\t\t\t\t\t\t\t\t\ttopclose : true,\n\t\t\t\t\t\t\t\t\t\tajax : {\n\t\t\t\t\t\t\t\t\t\t\tevalScripts : true\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t} \n\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\tevent.stop();\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t", '1890', '', '2,137,000', 'UNC', '~ .28–.79']
['', '', '1891', '', '605,000', 'UNC', '—']
[' ', '', '1892', '', '205,000', 'UNC', '~ .47']
[' ', '', '1893', '', '754,000', 'UNC', '~ .79']
[' ', '', '1894', '', '532,000', 'UNC', '~ .20']
[' ', '', '1895', '', '423,000', 'UNC', '~ .40']
['', '', '1896', '', '174,000', 'UNC', '—']

But when I'm trying to save it to Dataframe and export to excel it contains just the last value:

但是当我尝试将它保存到 Dataframe 并导出到 excel 时，它只包含最后一个值：

Answer 1

回答by phi

Try this

尝试这个

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=["A", "B", ...])

Answer 2

回答by ttfreeman

Pandas already has a built-in method to convert the table on the web to a dataframe:

Pandas 已经有一个内置方法可以将网络上的表格转换为数据框：

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

Answer 3

回答by Rakesh

Try:

尝试：

import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)


df = pd.DataFrame(res, columns=["Year", "Mintage", "Quality", "Price"])
print(df)

Output:

输出：

   Year  Mintage Quality    Price
0  1882  108,000     UNC        —
1  1883  786,000     UNC  ~ .03

Answer 4

回答by karina275

Just a head's up... This part of Rakesh's code means that only HTML rows containing text will be included in the dataframe, as the rows don't get appended if row is an empty list:

只是抬头...... Rakesh 代码的这部分意味着只有包含文本的 HTML 行才会包含在数据框中，因为如果 row 是空列表，则不会附加这些行：

if row:
    res.append(row)

Problematic in my use case, where I wanted to compare row indexing for the HTML and dataframe tables later on. I just needed to change it to:

在我的用例中存在问题，我想稍后比较 HTML 和数据帧表的行索引。我只需要将其更改为：

res.append(row)

Also, if a cell in the row is empty, it doesn't get included. This then messes up the columns. So I changed

此外，如果行中的单元格为空，则不会包含在内。这会弄乱列。所以我改变了

row = [tr.text.strip() for tr in td if tr.text.strip()]

to

到

row = [d.text.strip() for d in td]

But, otherwise, it's working for me. Thanks :)

但是，否则，它对我有用。谢谢：）

pandas 使用 BeautifulSoup 将表格抓取到数据框中

提问by Alex

回答by phi

回答by ttfreeman

回答by Rakesh

回答by karina275

相关推荐

最近更新

标签

pandas 使用 BeautifulSoup 将表格抓取到数据框中

提问by Alex

回答by phi

回答by ttfreeman

回答by Rakesh

回答by karina275

相关推荐

vba 如何使用格式化文本查找和替换单元格的一部分

根据 Pandas 中的常见列值合并两个数据框

vba 用excel VBA比较两个数组

pandas 将熊猫系列转换为 numpy 数组

相关推荐

最近更新

标签