如何使用 Pandas read_html 和请求库来读取表？

Question

提问by Terence Ng

How can I scrape the prices of a fund in:

我如何在以下位置抓取基金的价格：

http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U

It is wrong but how do I modify it:

这是错误的，但我如何修改它：

import pandas as pd
import requests
import re
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")})

Answer 1

采纳答案by brechin

I like lxml for parsing and querying HTML. Here's what I came up with:

我喜欢用 lxml 来解析和查询 HTML。这是我想出的：

import requests
from lxml import etree

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
doc = requests.get(url)
tree = etree.HTML(doc.content)

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]'

rows = tree.xpath(row_xpath)

for row in rows:
    (date_string, v1, v2) = (td.text for td in row.getchildren())
    print "%s - %s - %s" % (date_string, v1, v2)

Answer 2

回答by Terence Ng

My solution is similar to yours:

我的解决方案与您的类似：

import pandas as pd
import requests
from lxml import etree

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U"
r = requests.get(url)
html = etree.HTML(r.content)
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()')

if len(data) % 3 == 0:
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask'])
    df = df.set_index('date')
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y')
    df.sort_index(inplace = True)

如何使用 Pandas read_html 和请求库来读取表？

提问by Terence Ng

采纳答案by brechin

回答by Terence Ng

相关推荐

最近更新

标签

如何使用 Pandas read_html 和请求库来读取表？

提问by Terence Ng

采纳答案by brechin

回答by Terence Ng

相关推荐

检查 pandas 中任何列的任何行中是否存在一个值？

高效地将大型 Pandas 数据帧写入磁盘

使用 pandas.shift() 基于 scipy.signal.correlate 对齐数据集

Pandas：在 Dataframe 子集上使用 iterrows

相关推荐

最近更新

标签