使用 Python 从 HTML 中提取数据

Question

提问by Lormitto

I have following text processed by my code in Python:

我在 Python 中的代码处理了以下文本：

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>

Could you advice me how to extract data from within <td>? My idea is to put it in a CSV file with the following format: some link, some data 1, some data 2, some data 3.

你能建议我如何从内部提取数据<td>吗？我的想法是将其放入具有以下格式的 CSV 文件中：some link, some data 1, some data 2, some data 3.

I expect that without regular expression it might be hard but truly I still struggle against regular expressions.

我希望没有正则表达式可能会很难，但实际上我仍然在与正则表达式作斗争。

I used my code more or less in following manner:

我或多或少以以下方式使用了我的代码：

tabulka = subpage.find("table")

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
print col[0]

and ideally would be to get each td contend in some array. Html above is a result from python.

理想情况下是让每个 td 在某个数组中竞争。上面的html是python的结果。

Answer 1

采纳答案by Droogans

Get BeautifulSoupand just use it. It's great.

获取BeautifulSoup并使用它。这很棒。

$> easy_install pip
$> pip install BeautifulSoup
$> python
>>> from BeautifulSoup import BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>> soup = BS(html)
>>> elem = soup.findAll('a', {'title': 'title here'})
>>> elem[0].text

Answer 2

回答by 7stud

You shouldn't use regexes on html. You should use BeautifulSoup or lxml. Here are some examples using BeautifulSoup:

你不应该在 html 上使用正则表达式。您应该使用 BeautifulSoup 或 lxml。以下是一些使用 BeautifulSoup 的示例：

Your td tags actually look like this:

你的 td 标签实际上是这样的：

<td>newline
<a>some link</a>newline
<br />newline
some data 1<br />newline
some data 2<br />newline
some data 3</td>

So td.text looks like this:

所以 td.text 看起来像这样：

<newline>some link<newline><newline>some data 1<newline>some data 2<newline>some data 3

You can see that each string is separated by at least one newline, so that enables you to separate out each string.

您可以看到每个字符串至少由一个换行符分隔，因此您可以将每个字符串分开。

from bs4 import BeautifulSoup as bs
import re

html = """<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>"""

soup = bs(html)
tds = soup.find_all('td')
csv_data = []

for td in tds:
    inner_text = td.text
    strings = inner_text.split("\n")

    csv_data.extend([string for string in strings if string])

print(",".join(csv_data))

--output:--
some link,some data 1,some data 2,some data 3

Or more concisely:

或者更简洁：

for td in tds:
    print(re.sub("\n+", ",", td.text.lstrip() ) ) 

--output:--
some link,some data 1,some data 2,some data 3

But that solution is brittle because it won't work if your html looks like this:

但是该解决方案很脆弱，因为如果您的 html 如下所示，它将无法工作：

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />some data 1<br />some data 2<br />some data 3</td>

Now td.text looks like this:

现在 td.text 看起来像这样：

<newline>some link<newline>some data 1some data2some data3

And there isn't a way to figure out where some of the strings begin and end. But that just means you can't use td.text--there are still other ways to identify each string:

并且没有办法找出某些字符串的开始和结束位置。但这只是意味着您不能使用 td.text——还有其他方法可以识别每个字符串：

1)

from bs4 import BeautifulSoup as bs
import re

html = """<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />some data 1<br />some data 2<br />some data 3</td>"""

soup = bs(html)
tds = soup.find_all('td')
csv_data = []

for td in tds:
    a_tags = td.find_all('a')

    for a_tag in a_tags:
        csv_data.append(a_tag.text)
        br_tags = a_tag.findNextSiblings('br')

        for br in br_tags:
            csv_data.append(br.next.strip())  #get the element after the <br> tag

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

2)

for td in tds:
    a_tag = td.find('a')
    if a_tag: csv_data.append(a_tag.text)

    for string in a_tag.findNextSiblings(text=True):  #find only text nodes
        string = string.strip()
        if string: csv_data.append(string)

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

3)

for td in tds:
    a_tag = td.find('a')
    if a_tag: csv_data.append(a_tag.text)

    text_strings = a_tag.findNextSiblings( text=re.compile('\S+') )  #find only non-whitespace text nodes
    csv_data.extend(text_strings)

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

Answer 3

回答by CopyPasteIt

I've never used BeautifulSoup, but I would bet that it is 'html-tag-aware' and can handle 'filler' space. But since html markup files are structured (and usually generated by a web design program), you can also try a direct approach using Python's .split()method. Incidentally, I recently used this approach to parse out a real world url/html to do something very similar to what the OP wanted.

我从未使用过BeautifulSoup，但我敢打赌它是 'html-tag-aware' 并且可以处理 'filler' 空间。但是由于 html 标记文件是结构化的（并且通常由网页设计程序生成），您也可以尝试使用 Python 的.split()方法的直接方法。顺便说一句，我最近使用这种方法来解析真实世界的 url/html 来做一些与 OP 想要的非常相似的事情。

Although the OP wanted to pull only one field from the <a>tag, below we pull the 'usual two' fields.

尽管 OP 只想从<a>标签中提取一个字段，但在下面我们提取了“通常的两个”字段。

CODE:

代码：

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Extracting data from HTML using split()
# Link: https://stackoverflow.com/questions/17126686/extracting-data-from-html-with-python
#--------*---------*---------*---------*---------*---------*---------*---------*

import sys

page     = """blah blah blah
<td>
<a href="http://www.link1tosomewhere.net" title="title1 here">some link1</a>
<br />
some data1 1<br />
some data1 2<br />
some data1 3</td>
mlah mlah mlah
<td>
<a href="http://www.link2tosomewhere.net" title="title2 here">some link2</a>
<br />
some data2 1<br />
some data2 2<br />
some data2 3</td>
flah flah flah
"""

#--------*---------*---------*---------*---------*---------*---------*---------#
while 1:#                          M A I N L I N E                             #
#--------*---------*---------*---------*---------*---------*---------*---------#
    page = page.replace('\n','')   # remove \n from test html page
    csv = ''
    li = page.split('<td><a ')
    for i in range(0, len(li)):
        if li[i][0:6] == 'href="':
            s = li[i].split('</td>')[0]
#                                  # li2 ready for csv            
            li2 = s.split('<br />')
#                                  # create csv file
            for j in range(0, len(li2)):
#                                  # get two fields from li2[0]               
                if j == 0:
                    li3 = li2[0].split('"')
                    csv = csv + li3[1] + ','
                    li4 = li3[4].split('<')
                    csv = csv + li4[0][1:] + ','
#                                  # no comma on last field - \n instead
                elif j == len(li2) - 1:
                    csv = csv + li2[j] + '\n'
#                                  # just write out middle stuff                    
                else:
                    csv = csv + li2[j] + ','
    print(csv)                    
    sys.exit()

OUTPUT:

输出：

>>> 
= RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\board.py =
http://www.link1tosomewhere.net,some link1,some data1 1,some data1 2,some data1 3
http://www.link2tosomewhere.net,some link2,some data2 1,some data2 2,some data2 3

>>>

使用 Python 从 HTML 中提取数据

提问by Lormitto

采纳答案by Droogans

回答by 7stud

回答by CopyPasteIt

相关推荐

最近更新

标签

使用 Python 从 HTML 中提取数据

提问by Lormitto

采纳答案by Droogans

回答by 7stud

回答by CopyPasteIt

相关推荐

在 Python Pandas DataFrame 中将 timedelta64[ns] 列转换为秒

如何在 Python 中使用 tkinter 使用 GUI 编程计算器？

Python 使用 str.format 添加前导零

从 csv 文件中读取 Python 日期

相关推荐

最近更新

标签