pandas 在python中将html表转换为csv

Question

提问by Alexis Eggermont

I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table>elements.

我正在尝试从动态页面中抓取表格。在以下代码（需要硒）之后，我设法获取<table>元素的内容。

I'd like to convert this table into a csv and I have tried 2 things, but both fail:

我想将此表转换为 csv 并且我尝试了两件事，但都失败了：

pandas.read_htmlreturns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.
soup.find_all('tr')returns an error 'NoneType' object is not callableafter I run soup = BeautifulSoup(tablehtml)

pandas.read_html返回一个错误，说我没有安装 html5lib，但我安装了，事实上我可以毫无问题地导入它。
soup.find_all('tr')'NoneType' object is not callable运行后返回错误soup = BeautifulSoup(tablehtml)

Here is my code:

这是我的代码：

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
import pandas as pd

main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')

Answer 1

回答by Kruger

Without access to the table you're actually trying to scrape, I used this example:

无法访问您实际尝试抓取的表，我使用了以下示例：

<table>
<thead>
<tr>
    <td>Header1</td>
    <td>Header2</td>
    <td>Header3</td>
</tr>
</thead>  
<tr>
    <td>Row 11</td>
    <td>Row 12</td>
    <td>Row 13</td>
</tr>
<tr>
    <td>Row 21</td>
    <td>Row 22</td>
    <td>Row 23</td>
</tr>
<tr>
    <td>Row 31</td>
    <td>Row 32</td>
    <td>Row 33</td>
</tr>
</table>

and scraped it using:

并使用以下方法刮掉它：

from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]

This rowsobject is a list of lists:

这个行对象是一个列表列表：

[
    [<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
    [<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
    [<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
    [<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]

...and you can write it to a file:

...您可以将其写入文件：

for it in rows:
with open('result.csv', 'a') as f:
    f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')

which looks like this:

看起来像这样：

Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33

Answer 2

回答by AXO

Using the csvmodule and seleniumselectors would probably be more convenient here:

在这里使用csv模块和selenium选择器可能会更方便：

import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
    wr = csv.writer(csvfile)
    for row in table.find_elements_by_css_selector('tr'):
        wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])

pandas 在python中将html表转换为csv

提问by Alexis Eggermont

回答by Kruger

回答by AXO

相关推荐

最近更新

标签

pandas 在python中将html表转换为csv

提问by Alexis Eggermont

回答by Kruger

回答by AXO

相关推荐

pandas 在最近的时间戳上合并两个熊猫数据帧

Pandas groupby 应用执行缓慢

来自带有列表的字典的 Pandas DataFrame

使用 Pandas 使用分隔符读取 txt 文件创建 NaNs 列

相关推荐

最近更新

标签