使用 Python 将 html 转换为文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14694482/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:11:00  来源:igfitidea点击:

Converting html to text with Python

pythonweb-scrapingbeautifulsoup

提问by Aaron Bandelli

I am trying to convert an html block to text using Python.

我正在尝试使用 Python 将 html 块转换为文本。

Input:

输入:

<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>

Desired output:

期望的输出:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa

Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor 坐 amet,consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨

Consectetuer adipiscing 精英。Some Link Aenean commodo ligula eget dolor。埃涅斯马萨

Aenean massa.Lorem ipsum dolor 坐 amet,consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨

Lorem ipsum dolor 坐 amet,consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨

Consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅斯马萨

I have tried using html2text module without much success (i am quite new to python :))

我尝试使用 html2text 模块但没有取得太大成功(我对 python 很陌生:))

here is what i have tried:

这是我尝试过的:

#!/usr/bin/env python

import urllib2
import html2text
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())

txt = soup.find('div', {'class' : 'body'})

print html2text.html2text(txt)

the "txt" object produces the html block above. I'd like to convert it to text and print it on the screen.

“txt”对象生成上面的 html 块。我想将其转换为文本并将其打印在屏幕上。

Any help with the piece of code would be much appreciated.

对这段代码的任何帮助将不胜感激。

采纳答案by root

What am I missing? soup.get_text()gives exactly the same output you wanted...

我错过了什么?soup.get_text()提供与您想要的完全相同的输出...

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

output

输出

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

EDIT -And to keep newlines, as pointed out by @t-8ch:

编辑 -并保留换行符,正如@t-8ch 所指出的:

print(soup.get_text('\n'))

PS! To be exact you can replace newline with a double one -- then it is identical to your example :)

附注!确切地说,您可以用双换行替换换行符 - 那么它与您的示例相同:)

soup.get_text().replace('\n','\n\n')

回答by ATOzTOA

You can use regular expression... but not recommended...

您可以使用正则表达式...但不推荐...

The following code just removes all the HTML tags in your data, giving you the text.

以下代码仅删除数据中的所有 HTML 标记,为您提供文本。

import re

data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""

data = re.sub(r'<.*?>', '', data)

print data

Output

输出

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

回答by t-8ch

The '\n'places a newline between the paragraphs.

'\n'会将段落之间的换行符。

from bs4 import Beautifulsoup

soup = Beautifulsoup(text)
print(soup.get_text('\n'))

回答by Joseph Roten

I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.

我需要一种在客户端系统上执行此操作的方法,而无需下载其他库。我从来没有找到一个好的解决方案,所以我创建了自己的解决方案。如果您愿意,请随意使用它。

import urllib 

def html2text(strText):
    str1 = strText
    int2 = str1.lower().find("<body")
    if int2>0:
       str1 = str1[int2:]
    int2 = str1.lower().find("</body>")
    if int2>0:
       str1 = str1[:int2]
    list1 = ['<br>',  '<tr',  '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
    list2 = [chr(13), chr(13), chr(9), chr(13), chr(13),  chr(13), chr(13), chr(13)]
    bolFlag1 = True
    bolFlag2 = True
    strReturn = ""
    for int1 in range(len(str1)):
      str2 = str1[int1]
      for int2 in range(len(list1)):
        if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
           strReturn = strReturn + list2[int2]
      if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
         bolFlag1 = False
      if str1[int1:int1+6].lower() == '<style':
         bolFlag1 = False
      if str1[int1:int1+7].lower() == '</style':
         bolFlag1 = True
      if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
         bolFlag1 = True
      if str2 == '<':
         bolFlag2 = False
      if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
        strReturn = strReturn + str2
      if str2 == '>':
         bolFlag2 = True
      if bolFlag1 and bolFlag2:
        strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
    strReturn = strReturn.replace(chr(13), '\n')
    return strReturn


url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"    
html = urllib.urlopen(url).read()    
print html2text(html)

回答by Sarah Messer

It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:

可以使用 BeautifulSoup 删除不需要的脚本和类似的脚本,但您可能需要在几个不同的站点上进行试验以确保您已经涵盖了您希望排除的不同类型的内容。尝试这个:

from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
   if child.name == 'script':
       child.decompose() 
print(soup.body.get_text())

回答by FrBrGeorge

It's possible using python standard html.parser:

可以使用 python 标准html.parser

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ""
    def handle_data(self, data):
        self.text += data

f = HTMLFilter()
f.feed(data)
print(f.text)