使用 Python 将 html 转换为文本

Question

提问by Aaron Bandelli

I am trying to convert an html block to text using Python.

我正在尝试使用 Python 将 html 块转换为文本。

Input:

输入：

<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>

Desired output:

期望的输出：

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor 坐 amet，consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨
Consectetuer adipiscing 精英。Some Link Aenean commodo ligula eget dolor。埃涅斯马萨
Aenean massa.Lorem ipsum dolor 坐 amet，consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨
Lorem ipsum dolor 坐 amet，consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨
Consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨

I have tried using html2text module without much success (i am quite new to python :))

我尝试使用 html2text 模块但没有取得太大成功（我对 python 很陌生:)）

here is what i have tried:

这是我尝试过的：

#!/usr/bin/env python

import urllib2
import html2text
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())

txt = soup.find('div', {'class' : 'body'})

print html2text.html2text(txt)

the "txt" object produces the html block above. I'd like to convert it to text and print it on the screen.

“txt”对象生成上面的 html 块。我想将其转换为文本并将其打印在屏幕上。

Any help with the piece of code would be much appreciated.

对这段代码的任何帮助将不胜感激。

Answer 1

采纳答案by root

What am I missing? soup.get_text()gives exactly the same output you wanted...

我错过了什么？soup.get_text()提供与您想要的完全相同的输出...

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

output

输出

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

EDIT -And to keep newlines, as pointed out by @t-8ch:

编辑 -并保留换行符，正如@t-8ch 所指出的：

print(soup.get_text('\n'))

PS! To be exact you can replace newline with a double one -- then it is identical to your example :)

附注！确切地说，您可以用双换行替换换行符 - 那么它与您的示例相同:)

soup.get_text().replace('\n','\n\n')

Answer 2

回答by ATOzTOA

You can use regular expression... but not recommended...

您可以使用正则表达式...但不推荐...

The following code just removes all the HTML tags in your data, giving you the text.

以下代码仅删除数据中的所有 HTML 标记，为您提供文本。

import re

data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""

data = re.sub(r'<.*?>', '', data)

print data

Output

输出

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Answer 3

回答by t-8ch

The '\n'places a newline between the paragraphs.

该'\n'会将段落之间的换行符。

from bs4 import Beautifulsoup

soup = Beautifulsoup(text)
print(soup.get_text('\n'))

Answer 4

回答by Joseph Roten

I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.

我需要一种在客户端系统上执行此操作的方法，而无需下载其他库。我从来没有找到一个好的解决方案，所以我创建了自己的解决方案。如果您愿意，请随意使用它。

import urllib 

def html2text(strText):
    str1 = strText
    int2 = str1.lower().find("<body")
    if int2>0:
       str1 = str1[int2:]
    int2 = str1.lower().find("</body>")
    if int2>0:
       str1 = str1[:int2]
    list1 = ['<br>',  '<tr',  '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
    list2 = [chr(13), chr(13), chr(9), chr(13), chr(13),  chr(13), chr(13), chr(13)]
    bolFlag1 = True
    bolFlag2 = True
    strReturn = ""
    for int1 in range(len(str1)):
      str2 = str1[int1]
      for int2 in range(len(list1)):
        if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
           strReturn = strReturn + list2[int2]
      if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
         bolFlag1 = False
      if str1[int1:int1+6].lower() == '<style':
         bolFlag1 = False
      if str1[int1:int1+7].lower() == '</style':
         bolFlag1 = True
      if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
         bolFlag1 = True
      if str2 == '<':
         bolFlag2 = False
      if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
        strReturn = strReturn + str2
      if str2 == '>':
         bolFlag2 = True
      if bolFlag1 and bolFlag2:
        strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
    strReturn = strReturn.replace(chr(13), '\n')
    return strReturn


url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"    
html = urllib.urlopen(url).read()    
print html2text(html)

Answer 5

回答by Sarah Messer

It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:

可以使用 BeautifulSoup 删除不需要的脚本和类似的脚本，但您可能需要在几个不同的站点上进行试验以确保您已经涵盖了您希望排除的不同类型的内容。尝试这个：

from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
   if child.name == 'script':
       child.decompose() 
print(soup.body.get_text())

Answer 6

回答by FrBrGeorge

It's possible using python standard html.parser:

可以使用 python 标准html.parser：

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ""
    def handle_data(self, data):
        self.text += data

f = HTMLFilter()
f.feed(data)
print(f.text)

使用 Python 将 html 转换为文本

提问by Aaron Bandelli

采纳答案by root

回答by ATOzTOA

回答by t-8ch

回答by Joseph Roten

回答by Sarah Messer

回答by FrBrGeorge

相关推荐

最近更新

标签

使用 Python 将 html 转换为文本

提问by Aaron Bandelli

采纳答案by root

回答by ATOzTOA

回答by t-8ch

回答by Joseph Roten

回答by Sarah Messer

回答by FrBrGeorge

相关推荐

Python 从一维列表中创建一个二维列表

Python 类型对象 'datetime.datetime' 没有属性 'datetime'

没有循环的Python多维数组初始化

Python Django如何检查对象是否具有属性

相关推荐

最近更新

标签