如何用python和beautifulsoup解析html表并写入csv

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15250455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:37:52  来源:igfitidea点击:

How to parse html table with python and beautifulsoup and write to csv

pythonbeautifulsoup

提问by user2140323

I try to parse html page and fetch values for currencies and write to csv. I have following code:

我尝试解析 html 页面并获取货币值并写入 csv。我有以下代码:

#!/usr/bin/env python

import urllib2
from BeautifulSoup import BeautifulSoup

contenturl = "http://www.bank.gov.ua/control/en/curmetal/detail/currency?period=daily"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())

table = soup.find('div', attrs={'class': 'content'})

rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = td.find(text=True) + ';'
        print text,
    print

The problem is, that I do not know, how to retrieve only values for currency. I tried some regexp like '^[0-9]{3}' - start with 3 digits but it doesn't work.

问题是,我不知道如何仅检索货币值。我尝试了一些像 '^[0-9]{3}' 这样的正则表达式 - 以 3 位数字开头,但它不起作用。

采纳答案by Martijn Pieters

You'd be much better off picking out specific cells in the table. The tdcells with the cell_cclass contain data you are interested in, and the last one is always the currency exchange rate:

您最好选择表格中的特定单元格。td具有cell_c该类的单元格包含您感兴趣的数据,最后一个始终是货币汇率:

rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    if 'cell_c' in cols[0]['class']:
        # currency row
        digital_code, letter_code, units, name, rate = [c.text for c in cols]
        print digital_code, letter_code, units, name, rate

With the data in separate variables, you can now turn the text to decimal numbers, store them in a database, whatever.

使用单独变量中的数据,您现在可以将文本转换为十进制数字,将它们存储在数据库中,等等。