Python 使用正则表达式排除字符串搜索中的字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19924705/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:59:13  来源:igfitidea点击:

using regular expressions to exclude characters in a string search?

pythonregexstring

提问by Zack Cruise

I'm working with a Python 2.7.2 script to find lists of words inside of a text file that I'm using as a master word list.

我正在使用 Python 2.7.2 脚本在我用作主单词列表的文本文件中查找单词列表。

I am calling the script in a terminal window, inputting any number of regular expressions, and then running the script.

我在终端窗口中调用脚本,输入任意数量的正则表达式,然后运行脚本。

So, if I pass in the two regular expressions "^.....$" and ".*z" it will print every five letter word that contains at least one "z".

因此,如果我传入两个正则表达式“^.....$”和“.*z”,它将打印包含至少一个“z”的每五个字母的单词。

What I am trying to do is add another regular expression to EXCLUDE a character from the strings. I would like to print out all words that have five letters, a "z", but -not- a "y".

我想要做的是添加另一个正则表达式以从字符串中排除一个字符。我想打印出所有有五个字母的单词,一个“z”,但 -不是- 一个“y”。

Here is the code:

这是代码:

import re
import sys

def read_file_to_set(filename):
    words = None
    with open(filename) as f:
        words = [word.lower() for word in f.readlines()]
    return set(words)

def matches_all(word, regexes):
    for regex in regexes:
        if not regex.search(word):
            return False
    return True

if len(sys.argv) < 3:
    print "Needs a source dictionary and a series of regular expressions"
else:
    source = read_file_to_set(sys.argv[1])
    regexes = [re.compile(arg, re.IGNORECASE)
               for arg in sys.argv[2:]]
    for word in sorted(source):
        if matches_all(word.rstrip(), regexes):
            print word,

What modifiers can I put onto the regular expressions that I pass into the program to allow for me to exclude certain characters from the strings it prints?

我可以在传递给程序的正则表达式上添加哪些修饰符,以允许我从它打印的字符串中排除某些字符?

If that isn't possible, what needs to be implemented in the code?

如果这是不可能的,那么需要在代码中实现什么?

回答by piojo

Specifying a character that doesn't match is done with like this (this matches anything except a lower case letter):

指定一个不匹配的字符是这样完成的(这匹配除小写字母之外的任何内容):

[^a-z]

So to match a string that does not contain "y", the regex is: ^[^y]*$

因此,要匹配不包含“y”的字符串,正则表达式为: ^[^y]*$

Character by character explanation:

逐字解释:

^means "beginning" if it comes at the start of the regex. Similarly, $means "end" if it comes at the end. [abAB]matches any character within, or a range. For example, match any hex character (upper or lower case): [a-fA-F0-9]

^如果它出现在正则表达式的开头,则表示“开始”。同样,$如果它出现在末尾,则表示“结束”。 [abAB]匹配范围内或范围内的任何字符。例如,匹配任何十六进制字符(大写或小写):[a-fA-F0-9]

*means 0 or more of the previous expression. As the first character inside [], ^has a different meaning: it means "not". So [^a-fA-F0-9]matches any non-hex character.

*表示前面表达式的 0 个或多个。作为里面的第一个字符[]^有不同的含义:它的意思是“不是”。所以[^a-fA-F0-9]匹配任何非十六进制字符。

When you put a pattern between ^and $, you force the regex to match the string exactly (nothing before or after the pattern). Combine all these facts:

当你在^and之间放置一个模式时$,你会强制正则表达式完全匹配字符串(在模式之前或之后都没有)。结合所有这些事实:

^[^y]*$means string that is exactly 0 or more characters that are not 'y'. (To do something more interesting, you could check for non-numbers: ^[^0-9]$

^[^y]*$表示正好是 0 个或多个不是 'y' 字符的字符串。(为了做一些更有趣的事情,你可以检查非数字:^[^0-9]$

回答by VooDooNOFX

You can accomplish this with negative look arounds. This isn't a task that Regexs are particularly fast at, but it does work. To match everything except a sub-string foo, you can use:

您可以使用negative look arounds. 这不是正则表达式特别擅长的任务,但它确实有效。要匹配除子字符串之外的所有内容foo,您可以使用:

>>> my_regex = re.compile(r'^((?!foo).)*$', flags = re.I)
>>> print my_regex.match(u'IMatchJustFine')
<_sre.SRE_Match object at 0x1034ea738>
>>> print my_regex.match(u'IMatchFooFine')
None

As others have pointed out, if you're only matching a single character, then a simple not will suffice. Longer and more complex negative matches would need to use this approach.

正如其他人指出的那样,如果您只匹配一个字符,那么简单的 not 就足够了。更长和更复杂的否定匹配将需要使用这种方法。