用于读取类似 CSV 的行的 Python 正则表达式

Question

提问by Tomasz Zieliński

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:

我想解析传入的类似 CSV 的数据行。值用逗号分隔（逗号周围可能有前导和尾随空格），并且可以用 ' 或用 " 引用。例如 - 这是一个有效的行：

    data1, data2  ,"data3'''",  'data4""',,,data5,

but this one is malformed:

但是这个格式不正确：

    data1, data2, da"ta3", 'data4',

-- quotation marks can only be prepended or trailed by spaces.

-- 引号只能以空格开头或结尾。

Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.

应该识别这种格式错误的行 - 最好是在行内以某种方式标记格式错误的值，但如果正则表达式与整行不匹配，那么它也是可以接受的。

I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.

我正在尝试编写能够解析这个的正则表达式，使用 findall() 的 match()，但我带来的每个正则表达式都有一些边缘情况的问题。

So, maybe someone with experience in parsing something similar could help me on this? (Or maybe this is too complex for regex and I should just write a function)

那么，也许有解析类似东西经验的人可以帮助我解决这个问题？（或者这对于正则表达式来说太复杂了，我应该写一个函数）

EDIT1:

编辑1：

csvmodule is not much of use here:

csv模块在这里用处不大：

    >>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
    [['2', ' "dat', 'a1"', " 'dat", "a2'", '']]

    >>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
    [['2', 'dat,a1', "'dat", "a2'", '']]

-- unless this can be tuned?

——除非可以调整？

EDIT2: A few language edits - I hope it's more valid English now

EDIT2：一些语言编辑 - 我希望它现在更有效的英语

EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

EDIT3：感谢您提供所有答案，我现在很确定正则表达式在这里不是一个好主意，因为 (1) 涵盖所有边缘情况可能很棘手 (2) 编写器输出不规则。写到这里，我决定检查提到的 pyparsing 并使用它，或者编写自定义的类似 FSM 的解析器。

Answer 1

采纳答案by Peter Hansen

Although it would likely be possible with some combination of pre-processing, use of csvmodule, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csvmodule, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).

尽管预处理、csv模块的使用、后处理和正则表达式的使用的某种组合可能是可能的，但您陈述的要求与csv模块的设计不符，也不可能与正则表达式（取决于您可能必须处理的嵌套引号的复杂性）。

In complex parsing cases, pyparsingis always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...

在复杂的解析情况下，pyparsing总是一个很好的依赖包。如果这不是一次性的情况，它可能会产生最直接和可维护的结果，代价是可能预先付出一些额外的努力。然而，考虑到这项投资会很快得到回报，因为您可以节省调试正则表达式解决方案以处理极端情况的额外工作......

You can likely find examples of pyparsing-based CSV parsing easily, with this questionmaybe enough to get you started.

您可能可以轻松找到基于 pyparsing 的 CSV 解析示例，这个问题可能足以让您入门。

Answer 2

回答by Max Shawabkeh

While the csvmodule is the right answer here, a regex that could do this is quite doable:

虽然csv模块是这里的正确答案，但可以做到这一点的正则表达式是非常可行的：

import re

r = re.compile(r'''
    \s*                # Any whitespace.
    (                  # Start capturing here.
      [^,"']+?         # Either a series of non-comma non-quote characters.
      |                # OR
      "(?:             # A double-quote followed by a string of characters...
          [^"\]|\.   # That are either non-quotes or escaped...
       )*              # ...repeated any number of times.
      "                # Followed by a closing double-quote.
      |                # OR
      '(?:[^'\]|\.)*'# Same as above, for single quotes.
    )                  # Done capturing.
    \s*                # Allow arbitrary space before the comma.
    (?:,|$)            # Followed by a comma or the end of a string.
    ''', re.VERBOSE)

line = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""

print r.findall(line)

# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']

EDIT:To validate lines, you can reuse the regex above with small additions:

编辑：要验证行，您可以重复使用上面的正则表达式并添加少量内容：

import re

r_validation = re.compile(r'''
    ^(?:    # Capture from the start.
      # Below is the same regex as above, but condensed.
      # One tiny modification is that it allows empty values
      # The first plus is replaced by an asterisk.
      \s*([^,"']*?|"(?:[^"\]|\.)*"|'(?:[^'\]|\.)*')\s*(?:,|$)
    )*$    # And don't stop until the end.
    ''', re.VERBOSE)

line1 = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""

if r_validation.match(line1):
    print 'Line 1 is valid.'
else:
    print 'Line 1 is INvalid.'

if r_validation.match(line2):
    print 'Line 2 is valid.'
else:
    print 'Line 2 is INvalid.'

# Prints:
#    Line 1 is valid.
#    Line 2 is INvalid.

Answer 3

回答by pwdyson

Python has a standard library module to read csv files:

Python 有一个标准库模块来读取 csv 文件：

import csv

reader = csv.reader(open('file.csv'))

for line in reader:
    print line

For your example input this prints

对于您的示例输入，这会打印

['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']

EDIT:

编辑：

you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.

您需要添加 skipinitalspace=True 以在您提供的额外示例的双引号之前允许空格。还不确定单引号。

>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

Answer 4

回答by John Machin

It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.

不可能给你答案，因为你没有完全指定作者正在使用的协议。

It evidently contains rules like:

它显然包含如下规则：

If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).

如果字段包含任何逗号或单引号，请用双引号将其引用。
否则，如果该字段包含任何双引号，请用单引号将其引用。
注意：如果在上述 2 个子句中交换 double 和 single，结果仍然有效。
否则不要引用它。
结果字段可能有空格（或其他空格？）前置或附加。
如此增强的字段组合成一行，以逗号分隔并以平台的换行符（LF 或 CRLF）终止。

What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.

没有提到的是作者在这些情况下所做的事情：
(0) 字段包含单引号和双引号
(1) 字段包含前导非换行空格
(2) 字段包含尾随非换行空格
(3) 字段包含任何换行符。
如果作者忽略任何这些情况，请说明您想要的结果。

You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5,fails on the first comma.

您还提到“引号只能以空格开头或结尾”-当然您的意思是也允许使用逗号，否则您的示例'data4""',,,data5,在第一个逗号上失败。

How is your data encoded?

你的数据是如何编码的？

Answer 5

回答by onaclov2000

This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).

这可能听起来太简单了，但实际上从您正在寻找的字符串的外观来看，它包含 [a-zA-Z0-9]["']+[a-zA-Z0-9]，我的意思是没有 in针对数据的深度测试您真正要查找的是字母之间的引号或双引号（或任何组合）（您也可以在那里添加数字）。

Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.

根据您的要求，它是 CSV 真的无关紧要，重要的是您的数据不符合要求。我相信这只是搜索一个字母，然后是一个或多个 " 或 ' 和另一个字母的任意组合。

Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?

现在，您是要获取“数量”还是只是包含它的行的打印输出，以便您知道要返回并修复哪些？

I'm sorry I don't know python regex's but in perl this would look something like this:

对不起，我不知道 python 正则表达式，但在 perl 中，这看起来像这样：

# Look for one or more letter/number at least one ' or " or more and at least one    
#  or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
    # Prints the line if the above regex is found
    print $line;

}

Just simply convert that for when you look at a line.

当您查看一条线时，只需简单地将其转换即可。

I'm sorry if I misunderstood the question

如果我误解了这个问题，我很抱歉

I hope it helps!

我希望它有帮助！

Answer 6

回答by knipknap

If your goal is to convert the data to XML (or JSON, or YAML), look at this examplefor a Gelatinsyntax that produces the following output:

如果您的目标是将数据转换为 XML（或 JSON 或 YAML），请查看此示例以了解生成以下输出的Gelatin语法：

<xml>
  <line>
    <column>data1</column>
    <column>data2  </column>
    <column>data3'''</column>
    <column>data4""</column>
    <column/>
    <column/>
    <column>data5</column>
    <column/>
  </line>
</xml>

Note that Gelatin also has a Python API:

注意 Gelatin 也有一个 Python API：

from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

用于读取类似 CSV 的行的 Python 正则表达式

提问by Tomasz Zieliński

采纳答案by Peter Hansen

回答by Max Shawabkeh

回答by pwdyson

回答by John Machin

回答by onaclov2000

回答by knipknap

相关推荐

最近更新

标签

用于读取类似 CSV 的行的 Python 正则表达式

提问by Tomasz Zieliński

采纳答案by Peter Hansen

回答by Max Shawabkeh

回答by pwdyson

回答by John Machin

回答by onaclov2000

回答by knipknap

相关推荐

python 伦如何工作？

我在哪里可以获得 Python 的 OpenCV？

python 如何在python中创建嵌套列表？

python 如何使用 /dev/ptmx 创建虚拟串口？

相关推荐

最近更新

标签