pandas 在python中将html表转换为csv
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33633416/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
convert html table to csv in python
提问by Alexis Eggermont
I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table>
elements.
我正在尝试从动态页面中抓取表格。在以下代码(需要硒)之后,我设法获取<table>
元素的内容。
I'd like to convert this table into a csv and I have tried 2 things, but both fail:
我想将此表转换为 csv 并且我尝试了两件事,但都失败了:
pandas.read_html
returns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.soup.find_all('tr')
returns an error'NoneType' object is not callable
after I runsoup = BeautifulSoup(tablehtml)
pandas.read_html
返回一个错误,说我没有安装 html5lib,但我安装了,事实上我可以毫无问题地导入它。soup.find_all('tr')
'NoneType' object is not callable
运行后返回错误soup = BeautifulSoup(tablehtml)
Here is my code:
这是我的代码:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import pandas as pd
main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')
回答by Kruger
Without access to the table you're actually trying to scrape, I used this example:
无法访问您实际尝试抓取的表,我使用了以下示例:
<table>
<thead>
<tr>
<td>Header1</td>
<td>Header2</td>
<td>Header3</td>
</tr>
</thead>
<tr>
<td>Row 11</td>
<td>Row 12</td>
<td>Row 13</td>
</tr>
<tr>
<td>Row 21</td>
<td>Row 22</td>
<td>Row 23</td>
</tr>
<tr>
<td>Row 31</td>
<td>Row 32</td>
<td>Row 33</td>
</tr>
</table>
and scraped it using:
并使用以下方法刮掉它:
from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]
This rowsobject is a list of lists:
这个行对象是一个列表列表:
[
[<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
[<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
[<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
[<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]
...and you can write it to a file:
...您可以将其写入文件:
for it in rows:
with open('result.csv', 'a') as f:
f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')
which looks like this:
看起来像这样:
Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33
回答by AXO
Using the csv
module and selenium
selectors would probably be more convenient here:
在这里使用csv
模块和selenium
选择器可能会更方便:
import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
wr = csv.writer(csvfile)
for row in table.find_elements_by_css_selector('tr'):
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])