如何使用正则表达式从python中的字符串中删除标签？（不是在 HTML 中）

Question

提问by Tanner Semerad

I need to remove tags from a string in python.

我需要从 python 中的字符串中删除标签。

<FNT name="Century Schoolbook" size="22">Title</FNT>

What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.

删除两端整个标签，只留下“标题”的最有效方法是什么？我只看到过使用 HTML 标签执行此操作的方法，而这在 python 中对我不起作用。我特别将它用于 GIS 程序 ArcMap。它有自己的布局元素标签，我只需要删除两个特定标题文本元素的标签。我相信正则表达式应该适用于此，但我愿意接受任何其他建议。

Answer 1

采纳答案by Domenic

This should work:

这应该有效：

import re
re.sub('<[^>]*>', '', mystring)

To everyone saying that regexes are not the correct tool for the job:

每个人都说正则表达式不是这项工作的正确工具：

The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.

问题的背景是所有关于常规/无上下文语言的反对意见都是无效的。他的语言主要由三个实体：a = <，b = >，和c = [^><]+。他想删除任何出现的acb. 这相当直接地将他的问题描述为涉及上下文无关文法的问题，并且将其描述为常规问题并不难。

I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.

我知道每个人都喜欢“你不能用正则表达式解析 HTML”的答案，但是 OP 不想解析它，他只想执行一个简单的转换。

Answer 2

回答by Eric Fortin

If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.

如果只是为了解析和检索值，你可以看看BeautifulStoneSoup。

Answer 3

回答by Dagg Nabbit

Searching this regex and replacing it with an empty string should work.

搜索此正则表达式并将其替换为空字符串应该可以工作。

/<[A-Za-z\/][^>]*>/

Example (from python shell):

示例（来自 python shell）：

>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title

Answer 4

回答by ianmclaury

If the source text is well-formed XML, you can use the stdlib module ElementTree:

如果源文本是格式良好的 XML，则可以使用 stdlib 模块ElementTree：

import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text  # 'Title'

If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.

如果源格式不正确，BeautifulSoup 是一个很好的建议。正如一些海报指出的那样，使用正则表达式来解析标签并不是一个好主意。

Answer 5

回答by Nathan Davis

Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

使用 XML 解析器，例如 ElementTree。正则表达式不是这项工作的正确工具。

Answer 6

回答by Aminah Nuraini

Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.

请避免使用正则表达式。尽管正则表达式可以处理你的简单字符串，但如果你得到一个复杂的字符串，你将来会遇到问题。

You can use BeautifulSoup get_text()feature.

您可以使用 BeautifulSoupget_text()功能。

from bs4 import BeautifulSoup

text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)

print(soup.get_text())

如何使用正则表达式从python中的字符串中删除标签？（不是在 HTML 中）

提问by Tanner Semerad

采纳答案by Domenic

回答by Eric Fortin

回答by Dagg Nabbit

回答by ianmclaury

回答by Nathan Davis

回答by Aminah Nuraini

相关推荐

最近更新

标签

如何使用正则表达式从python中的字符串中删除标签？（不是在 HTML 中）

提问by Tanner Semerad

采纳答案by Domenic

回答by Eric Fortin

回答by Dagg Nabbit

回答by ianmclaury

回答by Nathan Davis

回答by Aminah Nuraini

相关推荐

Python 如何获取pandas DataFrame的最后N行？

如何在所有操作系统上用 Python 解压缩文件？

Python SciPy 创建 2D 多边形蒙版

Python 外键 Django 模型

相关推荐

最近更新

标签