如何使用正则表达式从python中的字符串中删除标签?(不是在 HTML 中)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3662142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:08:26  来源:igfitidea点击:

How to remove tags from a string in python using regular expressions? (NOT in HTML)

pythonstriparcmap

提问by Tanner Semerad

I need to remove tags from a string in python.

我需要从 python 中的字符串中删除标签。

<FNT name="Century Schoolbook" size="22">Title</FNT>

What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.

删除两端整个标签,只留下“标题”的最有效方法是什么?我只看到过使用 HTML 标签执行此操作的方法,而这在 python 中对我不起作用。我特别将它用于 GIS 程序 ArcMap。它有自己的布局元素标签,我只需要删除两个特定标题文本元素的标签。我相信正则表达式应该适用于此,但我愿意接受任何其他建议。

采纳答案by Domenic

This should work:

这应该有效:

import re
re.sub('<[^>]*>', '', mystring)

To everyone saying that regexes are not the correct tool for the job:

每个人都说正则表达式不是这项工作的正确工具:

The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.

问题的背景是所有关于常规/无上下文语言的反对意见都是无效的。他的语言主要由三个实体:a = <b = >,和c = [^><]+。他想删除任何出现的acb. 这相当直接地将他的问题描述为涉及上下文无关文法的问题,并且将其描述为常规问题并不难。

I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.

我知道每个人都喜欢“你不能用正则表达式解析 HTML”的答案,但是 OP 不想解析它,他只想执行一个简单的转换。

回答by Eric Fortin

If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.

如果只是为了解析和检索值,你可以看看BeautifulStoneSoup。

回答by Dagg Nabbit

Searching this regex and replacing it with an empty string should work.

搜索此正则表达式并将其替换为空字符串应该可以工作。

/<[A-Za-z\/][^>]*>/

Example (from python shell):

示例(来自 python shell):

>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title

回答by ianmclaury

If the source text is well-formed XML, you can use the stdlib module ElementTree:

如果源文本是格式良好的 XML,则可以使用 stdlib 模块ElementTree

import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text  # 'Title'

If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.

如果源格式不正确,BeautifulSoup 是一个很好的建议。正如一些海报指出的那样,使用正则表达式来解析标签并不是一个好主意。

回答by Nathan Davis

Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

使用 XML 解析器,例如 ElementTree。正则表达式不是这项工作的正确工具。

回答by Aminah Nuraini

Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.

请避免使用正则表达式。尽管正则表达式可以处理你的简单字符串,但如果你得到一个复杂的字符串,你将来会遇到问题。

You can use BeautifulSoup get_text()feature.

您可以使用 BeautifulSoupget_text()功能。

from bs4 import BeautifulSoup

text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)

print(soup.get_text())