python 正则表达式和 Unicode

Question

提问by dbr

I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi)

我有一个脚本可以解析电视剧集的文件名（例如 show.name.s01e02.avi），获取剧集名称（来自 www.thetvdb.com API）并自动将它们重命名为更好的名称（Show Name - [01x02 ].avi）

The script works fine, that is until you try and use it on files that have Unicode show-names (something I never really thought about, since all the files I have are English, so mostly pretty-much all fall within [a-zA-Z0-9'\-])

该脚本工作正常，直到您尝试在具有 Unicode 显示名称的文件上使用它（我从未真正考虑过这一点，因为我拥有的所有文件都是英文的，所以大部分几乎都属于[a-zA-Z0-9'\-]）

How can I allow the regular expressions to match accented characters and the likes? Currently the regex's config section looks like..

如何允许正则表达式匹配重音字符等？目前正则表达式的配置部分看起来像..

config['valid_filename_chars'] = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!@￡$%^&*()_+=-[]{}"'.,<>`~? """
config['valid_filename_chars_regex'] = re.escape(config['valid_filename_chars'])

config['name_parse'] = [
    # foo_[s01]_[e01]
    re.compile('''^([%s]+?)[ \._\-]\[[Ss]([0-9]+?)\]_\[[Ee]([0-9]+?)\]?[^\/]*$'''% (config['valid_filename_chars_regex'])),
    # foo.1x09*
    re.compile('''^([%s]+?)[ \._\-]\[?([0-9]+)x([0-9]+)[^\/]*$''' % (config['valid_filename_chars_regex'])),
    # foo.s01.e01, foo.s01_e01
    re.compile('''^([%s]+?)[ \._\-][Ss]([0-9]+)[\.\- ]?[Ee]([0-9]+)[^\/]*$''' % (config['valid_filename_chars_regex'])),
    # foo.103*
    re.compile('''^([%s]+)[ \._\-]([0-9]{1})([0-9]{2})[\._ -][^\/]*$''' % (config['valid_filename_chars_regex'])),
    # foo.0103*
    re.compile('''^([%s]+)[ \._\-]([0-9]{2})([0-9]{2,3})[\._ -][^\/]*$''' % (config['valid_filename_chars_regex'])),
]

Answer 1

采纳答案by Mark Cidade

Use a subrange of [\u0000-\uFFFF]for what you want.

使用[\u0000-\uFFFF]您想要的子范围。

You can also use the re.UNICODEcompile flag. The docssay that if UNICODEis set, \wwill match the characters [0-9_]plus whatever is classified as alphanumeric in the Unicode character properties database.

您还可以使用re.UNICODE编译标志。文档说如果UNICODE设置了，\w将匹配字符[0-9_]加上在 Unicode 字符属性数据库中归类为字母数字的任何内容。

See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

另见http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html。

Answer 2

回答by MRAB

Python's re module doesn't support \p{Letter} or \X. However, the new regex implementation on PyPIdoes.

Python 的 re 模块不支持 \p{Letter} 或 \X。但是，PyPI上的新正则表达式实现确实如此。

Answer 3

回答by Peter Stuifzand

In Mastering Regular Expressions from Jeffrey Friedl (great book) it is mentioned that you could use \p{Letter} which will match unicode stuff that is considered a letter.

在 Mastering Regular Expressions from Jeffrey Friedl（好书）中提到，您可以使用 \p{Letter} 来匹配被视为字母的 unicode 内容。

Answer 4

回答by grapefrukt

\X seems to be available as a generic word-character in some languages, it allows you to match a single character disregarding of how many bytes it takes up. Might be useful.

\X 似乎在某些语言中可用作通用字字符，它允许您匹配单个字符而不管它占用多少字节。可能有用。

python 正则表达式和 Unicode

提问by dbr

采纳答案by Mark Cidade

回答by MRAB

回答by Peter Stuifzand

回答by grapefrukt

相关推荐

最近更新

标签

python 正则表达式和 Unicode

提问by dbr

采纳答案by Mark Cidade

回答by MRAB

回答by Peter Stuifzand

回答by grapefrukt

相关推荐

java Spring boot 应用程序失败方法 org.postgresql.jdbc4.Jdbc4Connection.createClob() 尚未实现

java 如何从 Spring RestTemplate 中的对象获取列表

使用 Mockk 模拟静态 java 方法

java 在我的 `gradle build` 上遇到 `org.gradle.api.ProjectConfigurationException` 错误

相关推荐

最近更新

标签