Python 正则表达式引擎 - “后视需要固定宽度模式”错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20089922/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Regex Engine - "look-behind requires fixed-width pattern" Error
提问by SpikETidE
I am trying to handle un-matched double quotes within a string in the CSV format.
我正在尝试处理 CSV 格式的字符串中不匹配的双引号。
To be precise,
准确地说,
"It "does "not "make "sense", Well, "Does "it"
should be corrected as
应该更正为
"It" "does" "not" "make" "sense", Well, "Does" "it"
So basically what I am trying to do is to
所以基本上我想做的是
replace all the ' " '
- Not preceded by a beginning of line or a comma (and)
- Not followed by a comma or an end of line
with ' " " '
替换所有的 ' " '
- 前面没有行首或逗号(和)
- 后面没有逗号或行尾
和 ' ” ” '
For that I use the below regex
为此,我使用以下正则表达式
(?<!^|,)"(?!,|$)
The problem is while Ruby regex engines ( http://www.rubular.com/) are able to parse the regex, python regex engines (https://pythex.org/, http://www.pyregex.com/) throw the following error
问题是 Ruby 正则表达式引擎 ( http://www.rubular.com/) 能够解析正则表达式,python 正则表达式引擎 ( https://pythex.org/,http://www.pyregex.com/)抛出以下错误
Invalid regular expression: look-behind requires fixed-width pattern
And with python 2.7.3 it throws
使用 python 2.7.3,它会抛出
sre_constants.error: look-behind requires fixed-width pattern
Can anyone tell me what vexes python here?
谁能告诉我这里有什么讨厌的蟒蛇?
==================================================================================
================================================== ================================
EDIT :
编辑 :
Following Tim's response, I got the below output for a multi line string
按照蒂姆的回应,我得到了多行字符串的以下输出
>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '
At the end of each line, next to 'it' two double-quotes were added.
在每一行的末尾,在'it'旁边添加了两个双引号。
So I made a very small change to the regex to handle a new-line.
所以我对正则表达式做了一个很小的改动来处理换行。
re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
But this gives the output
但这给出了输出
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '
The last 'it' alone has two double-quotes.
最后一个“它”有两个双引号。
But I wonder why the '$' end of line character will not identify that the line has ended.
但我想知道为什么 '$' 行尾字符不会标识该行已结束。
==================================================================================
================================================== ================================
The final answer is
最终答案是
re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
采纳答案by Tim Pietzcker
Python lookbehind assertions need to be fixed width, but you can try this:
Python 后视断言需要固定宽度,但你可以试试这个:
>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'
Explanation:
解释:
\b # Start the match at the end of a "word"
\s* # Match optional whitespace
" # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
回答by Wiktor Stribi?ew
Python lookbehinds really need to be fixed-width, and when you have alternations in a lookbehind pattern that are of different length, there are several ways to handle this situation:
Python 后视确实需要固定宽度,当你在后视模式中有不同长度的交替时,有几种方法可以处理这种情况:
- Rewrite the pattern so that you do not have to use alternation (e.g. Tim's above answer using a word boundary, or you might also use an exact equivalent
(?<=[^,])"(?!,|$)of your current pattern that requires a char other than a comma before the double quote, or a common pattern to match words enclosed with whitespace,(?<=\s|^)\w+(?=\s|$), can be written as(?<!\S)\w+(?!\S)), or - Split the lookbehinds:
- Positive lookbehinds need to be alternated in a group (e.g.
(?<=a|bc)should be rewritten as(?:(?<=a)|(?<=bc))) - Negative lookbehinds can be just concatenated (e.g.
(?<!^|,)"(?!,|$)should look like(?<!^)(?<!,)"(?!,|$)).
- Positive lookbehinds need to be alternated in a group (e.g.
- 重写模式,以便您不必使用交替(例如 Tim 使用单词边界的上述答案,或者您也可以使用
(?<=[^,])"(?!,|$)当前模式的完全等效项,该模式需要在双引号前使用逗号以外的字符,或常见的匹配用空格括起来的单词的模式,(?<=\s|^)\w+(?=\s|$), 可以写成(?<!\S)\w+(?!\S)), 或 - 拆分后视:
- 正向后视需要在一组中交替进行(例如
(?<=a|bc)应重写为(?:(?<=a)|(?<=bc))) - 负的lookbehinds可以只是连接(例如
(?<!^|,)"(?!,|$)应该看起来像(?<!^)(?<!,)"(?!,|$))。
- 正向后视需要在一组中交替进行(例如

