HTML 表到 Pandas 表:html 标签内的信息
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31771619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML table to pandas table: Info inside html tags
提问by iayork
I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:
我有一个来自网络的大表,通过请求访问并用 BeautifulSoup 解析。它的一部分看起来像这样:
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>
When I convert this to pandas using pd.read_html(tbl)the output is like this:
当我使用pd.read_html(tbl)输出将其转换为Pandas时,如下所示:
0 1 2
0 265 JonesBlue 29
1 266 Smith 34
I need to keep the information in the <A HREF ... >tag, since the unique identifier is stored in the link. That is, the table should look like this:
我需要将信息保留在<A HREF ... >标签中,因为唯一标识符存储在链接中。也就是说,该表应如下所示:
0 1 2
0 265 jones03 29
1 266 smith01 34
I'm fine with various other outputs (for example, jones03 Joneswould be even more helpful) but the unique ID is critical.
我对各种其他输出很好(例如,jones03 Jones会更有帮助),但唯一 ID 至关重要。
Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.
其他单元格中也有 html 标签,通常我不希望保存这些标签,但如果这是获取 uid 的唯一方法,我可以保留这些标签并稍后清理它们,如果我必须这样做.
Is there a simple way of accessing this information?
有没有一种简单的方法来访问这些信息?
采纳答案by unutbu
Since this parsing job requires the extraction of both text and attribute
values, it can not be done entirely "out-of-the-box" by a function such as
pd.read_html. Some of it has to be done by hand.
由于此解析工作需要提取文本和属性值,因此它不能完全由诸如
pd.read_html. 其中一些必须手工完成。
Using lxml, you could extract the attribute values with XPath:
使用lxml,您可以使用 XPath 提取属性值:
import lxml.html as LH
import pandas as pd
content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''
table = LH.fromstring(content)
for df in pd.read_html(content):
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
产量
0 1 2 refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
The above may be useful since it requires only a few
extra lines of code to add the refnamecolumn.
以上可能很有用,因为它只需要几行额外的代码来添加refname列。
But both LH.fromstringand pd.read_htmlparse the HTML.
So it's efficiency could be improved by removing pd.read_htmland
parsing the table once with LH.fromstring:
但两者LH.fromstring并pd.read_html解析HTML。因此,可以通过删除pd.read_html和解析表一次来提高效率LH.fromstring:
table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')]
for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
产量
id name val refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
回答by k-nut
You could simply parse the table manually like this:
您可以像这样简单地手动解析表:
import BeautifulSoup
import pandas as pd
TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
record = []
record.append(trs[0].text)
record.append(trs[1].a["href"])
record.append(trs[2].text)
records.append(record)
df = pd.DataFrame(data=records)
df
which gives you
这给了你
0 1 2
0 265 /j/jones03.shtml 29
1 266 /s/smith01.shtml 34
回答by freeseek
You could use regular expressions to modify the text first and remove the html tags:
您可以使用正则表达式先修改文本并删除 html 标签:
import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\1 \2', tbl)
pd.read_html(tbl)
which gives you
这给了你
[ 0 1 2
0 265 /j/jones03.shtml JonesBlue 29
1 266 /s/smith01.shtml Smith 34]

