Python:如何将 Markdown 格式的文本转换为文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/761824/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:48:52  来源:igfitidea点击:

Python : How to convert markdown formatted text to text

pythonparsingmarkdown

提问by Krish

I need to convert markdown text to plain text format to display summary in my website. I want the code in python.

我需要将 Markdown 文本转换为纯文本格式以在我的网站中显示摘要。我想要python中的代码。

回答by Jason Coon

This module will help do what you describe:

该模块将帮助您完成您的描述:

http://www.freewisdom.org/projects/python-markdown/Using_as_a_Module

http://www.freewisdom.org/projects/python-markdown/Using_as_a_Module

Once you have converted the markdown to HTML, you can use a HTML parser to strip out the plain text.

将 Markdown 转换为 HTML 后,您可以使用 HTML 解析器去除纯文本。

Your code might look something like this:

您的代码可能如下所示:

from BeautifulSoup import BeautifulSoup
from markdown import markdown

html = markdown(some_html_string)
text = ''.join(BeautifulSoup(html).findAll(text=True))

回答by Pavel Vorobyov

Despite the fact that this is a very old question, I'd like to suggest a solution I came up with recently. This one neither uses BeautifulSoup nor has an overhead of converting to html and back.

尽管这是一个非常古老的问题,但我还是想提出一个我最近想出的解决方案。这个既不使用 BeautifulSoup 也没有转换为 html 和返回的开销。

The markdownmodule core class Markdown has a property output_formatswhich is not configurable but otherwise patchable like almost anything in python is. This property is a dict mapping output format name to a rendering function. By default it has two output formats, 'html' and 'xhtml' correspondingly. With a little help it may have a plaintext rendering function which is easy to write:

降价模块核心类降价有一个属性output_formats这是不是配置的,但以其他方式可修补像蟒蛇几乎所有的东西是。此属性是将输出格式名称映射到渲染函数的字典。默认情况下,它有两种输出格式,分别是 'html' 和 'xhtml'。借助一点帮助,它可能具有易于编写的纯文本渲染功能:

from markdown import Markdown
from io import StringIO


def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)

unmarkfunction takes markdown text as an input and returns all the markdown characters stripped out.

unmark函数将 Markdown 文本作为输入并返回所有被剥离的 Markdown 字符。

回答by Rob

Commented and removed it because I finally think I see the rub here: It may be easier to convert your markdown text to HTML and remove HTML from the text. I'm not aware of anything to remove markdown from text effectively but there are many HTML to plain text solutions.

评论并删除它,因为我终于觉得我在这里看到了问题:将 Markdown 文本转换为 HTML 并从文本中删除 HTML 可能更容易。我不知道有什么可以有效地从文本中删除降价,但是有很多 HTML 到纯文本的解决方案。