HTML 表到 Pandas 表:html 标签内的信息

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31771619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:43:14  来源:igfitidea点击:

HTML table to pandas table: Info inside html tags

pythonpandasbeautifulsoup

提问by iayork

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

我有一个来自网络的大表,通过请求访问并用 BeautifulSoup 解析。它的一部分看起来像这样:

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

When I convert this to pandas using pd.read_html(tbl)the output is like this:

当我使用pd.read_html(tbl)输出将其转换为Pandas时,如下所示:

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

I need to keep the information in the <A HREF ... >tag, since the unique identifier is stored in the link. That is, the table should look like this:

我需要将信息保留在<A HREF ... >标签中,因为唯一标识符存储在链接中。也就是说,该表应如下所示:

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

I'm fine with various other outputs (for example, jones03 Joneswould be even more helpful) but the unique ID is critical.

我对各种其他输出很好(例如,jones03 Jones会更有帮助),但唯一 ID 至关重要。

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

其他单元格中也有 html 标签,通常我不希望保存这些标签,但如果这是获取 uid 的唯一方法,我可以保留这些标签并稍后清理它们,如果我必须这样做.

Is there a simple way of accessing this information?

有没有一种简单的方法来访问这些信息?

采纳答案by unutbu

Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

由于此解析工作需要提取文本和属性值,因此它不能完全由诸如 pd.read_html. 其中一些必须手工完成。

Using lxml, you could extract the attribute values with XPath:

使用lxml,您可以使用 XPath 提取属性值:

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)

yields

产量

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01


The above may be useful since it requires only a few extra lines of code to add the refnamecolumn.

以上可能很有用,因为它只需要几行额外的代码来添加refname列。

But both LH.fromstringand pd.read_htmlparse the HTML. So it's efficiency could be improved by removing pd.read_htmland parsing the table once with LH.fromstring:

但两者LH.fromstringpd.read_html解析HTML。因此,可以通过删除pd.read_html和解析表一次来提高效率LH.fromstring

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

yields

产量

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

回答by k-nut

You could simply parse the table manually like this:

您可以像这样简单地手动解析表:

import BeautifulSoup
import pandas as pd

TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""

table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
    trs = tr.findAll("td")
    record = []
    record.append(trs[0].text)
    record.append(trs[1].a["href"])
    record.append(trs[2].text)
    records.append(record)

df = pd.DataFrame(data=records)
df

which gives you

这给了你

     0                 1   2
0  265  /j/jones03.shtml  29
1  266  /s/smith01.shtml  34

回答by freeseek

You could use regular expressions to modify the text first and remove the html tags:

您可以使用正则表达式先修改文本并删除 html 标签:

import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\1 \2', tbl)
pd.read_html(tbl)

which gives you

这给了你

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]