Python 正则表达式引擎 - “后视需要固定宽度模式”错误

Question

提问by SpikETidE

I am trying to handle un-matched double quotes within a string in the CSV format.

我正在尝试处理 CSV 格式的字符串中不匹配的双引号。

To be precise,

准确地说，

"It "does "not "make "sense", Well, "Does "it"

should be corrected as

应该更正为

"It" "does" "not" "make" "sense", Well, "Does" "it"

So basically what I am trying to do is to

所以基本上我想做的是

replace all the ' " '
Not preceded by a beginning of line or a comma (and)
Not followed by a comma or an end of line
with ' " " '

替换所有的 ' " '
前面没有行首或逗号（和）
后面没有逗号或行尾
和 ' ” ” '

For that I use the below regex

为此，我使用以下正则表达式

(?<!^|,)"(?!,|$)

The problem is while Ruby regex engines ( http://www.rubular.com/) are able to parse the regex, python regex engines (https://pythex.org/, http://www.pyregex.com/) throw the following error

问题是 Ruby 正则表达式引擎 ( http://www.rubular.com/) 能够解析正则表达式，python 正则表达式引擎 ( https://pythex.org/，http://www.pyregex.com/)抛出以下错误

Invalid regular expression: look-behind requires fixed-width pattern

And with python 2.7.3 it throws

使用 python 2.7.3，它会抛出

sre_constants.error: look-behind requires fixed-width pattern

Can anyone tell me what vexes python here?

谁能告诉我这里有什么讨厌的蟒蛇？

==================================================================================

================================================== ================================

EDIT :

编辑：

Following Tim's response, I got the below output for a multi line string

按照蒂姆的回应，我得到了多行字符串的以下输出

>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '

At the end of each line, next to 'it' two double-quotes were added.

在每一行的末尾，在'it'旁边添加了两个双引号。

So I made a very small change to the regex to handle a new-line.

所以我对正则表达式做了一个很小的改动来处理换行。

re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)

But this gives the output

但这给出了输出

>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '

The last 'it' alone has two double-quotes.

最后一个“它”有两个双引号。

But I wonder why the '$' end of line character will not identify that the line has ended.

但我想知道为什么 '$' 行尾字符不会标识该行已结束。

==================================================================================

================================================== ================================

The final answer is

最终答案是

re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)

Answer 1

采纳答案by Tim Pietzcker

Python lookbehind assertions need to be fixed width, but you can try this:

Python 后视断言需要固定宽度，但你可以试试这个：

>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'

Explanation:

解释：

\b      # Start the match at the end of a "word"
\s*     # Match optional whitespace
"       # Match a quote
(?!,|$) # unless it's followed by a comma or end of string

Answer 2

回答by Wiktor Stribi?ew

Python lookbehinds really need to be fixed-width, and when you have alternations in a lookbehind pattern that are of different length, there are several ways to handle this situation:

Python 后视确实需要固定宽度，当你在后视模式中有不同长度的交替时，有几种方法可以处理这种情况：

Rewrite the pattern so that you do not have to use alternation (e.g. Tim's above answer using a word boundary, or you might also use an exact equivalent (?<=[^,])"(?!,|$)of your current pattern that requires a char other than a comma before the double quote, or a common pattern to match words enclosed with whitespace, (?<=\s|^)\w+(?=\s|$), can be written as (?<!\S)\w+(?!\S)), or
Split the lookbehinds:
- Positive lookbehinds need to be alternated in a group (e.g. (?<=a|bc)should be rewritten as (?:(?<=a)|(?<=bc)))
- Negative lookbehinds can be just concatenated (e.g. (?<!^|,)"(?!,|$)should look like (?<!^)(?<!,)"(?!,|$)).

重写模式，以便您不必使用交替（例如 Tim 使用单词边界的上述答案，或者您也可以使用(?<=[^,])"(?!,|$)当前模式的完全等效项，该模式需要在双引号前使用逗号以外的字符，或常见的匹配用空格括起来的单词的模式，(?<=\s|^)\w+(?=\s|$), 可以写成(?<!\S)\w+(?!\S)), 或
拆分后视：
- 正向后视需要在一组中交替进行（例如(?<=a|bc)应重写为(?:(?<=a)|(?<=bc))）
- 负的lookbehinds可以只是连接（例如(?<!^|,)"(?!,|$)应该看起来像(?<!^)(?<!,)"(?!,|$)）。

Python 正则表达式引擎 - “后视需要固定宽度模式”错误

提问by SpikETidE

EDIT :

编辑：

采纳答案by Tim Pietzcker

回答by Wiktor Stribi?ew

相关推荐

最近更新

标签

Python 正则表达式引擎 - “后视需要固定宽度模式”错误

提问by SpikETidE

EDIT :

编辑 ：

采纳答案by Tim Pietzcker

回答by Wiktor Stribi?ew

相关推荐

使用python读取csv中的特定列

Python Pandas 数据框获取每组的第一行

Python Pandas 获得每组中最高的 n 条记录

Python 语法错误：无效语法 end=''

相关推荐

最近更新

标签

编辑：