php 如何验证csv文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2450345/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 06:32:49  来源:igfitidea点击:

How to validate csv file?

phpcsv

提问by Rachel

How can we validate a CSV file ?

我们如何验证 CSV 文件?

I have an CSV file of structure:

我有一个结构的 CSV 文件:

Date;Id;Shown
15-Mar-10;231;345
15-Mar-10;232;346
and so on and on !!! approx around 80,000 rows. 

How can I validate this CSV file before starting the parsing using fgetcsv?

在使用 开始解析之前如何验证此 CSV 文件fgetcsv

回答by Pascal MARTIN

I would not try to validate the file before hand : I would rather prefer going through it line by line, dealing with each line separately :

我不会事先尝试验证文件:我宁愿逐行检查它,分别处理每一行:

  • Reading one line
  • Verifying it's OK
  • using the data
  • and going to next line.
  • 读一行
  • 验证没问题
  • 使用数据
  • 并转到下一行。


Now, what could "verify it's OK" means ?


现在,“验证没问题”是什么意思?

  • At least : make sure I can read the line as CSV, with my normal set of functions (maybe fgetcsv, maybe some other function specific to my project -- anyway, if I cannot read one line with my function that reads hundreds, it's probably because there's a problem on that line)
  • Then, check for the number of fields
  • then, for each field, check if it contains "valid" data
    • mandatory ? optionnal ?
    • numeric ?
    • string ?
    • date ?
    • and so on
  • then, for each field, some more careful checks
    • for instance, for a "code" field : does it correspond to a value that's legal for my application ?
  • 至少:确保我可以使用我的正常函数集将这一行读取为 CSV (也许fgetcsv,也许还有一些特定于我的项目的其他函数 - 无论如何,如果我无法使用读取数百个的函数读取一行,这可能是因为那条线上有问题)
  • 然后,检查字段数
  • 然后,对于每个字段,检查它是否包含“有效”数据
    • 强制的 ?可选的?
    • 数字 ?
    • 细绳 ?
    • 日期 ?
    • 等等
  • 然后,对于每个字段,进行一些更仔细的检查
    • 例如,对于“代码”字段:它是否对应于对我的应用程序合法的值?

If all that goes OK -- well, not much more to do, excepts use the data ;-)
And when you're done with one line, just go repeat for the next one.

如果一切顺利——好吧,除了使用数据之外,没有太多事情要做;-)
当你完成一行时,只需重复下一行。


Of course, if you want to either accept or reject a whole file before doing any database (or anything like that)write, you'll have to :


当然,如果您想在执行任何数据库(或类似的)写入之前接受或拒绝整个文件,您必须:

  • parse the file, line by line, applying the "verifying" ideas
  • store the data of each line in memory
  • and, when the whole file has been read to memory,
    • either start using the data
    • or, if there's been an error on one line, reject everything.
  • 逐行解析文件,应用“验证”想法
  • 将每一行的数据存储在内存中
  • 并且,当整个文件被读入内存时,
    • 要么开始使用数据
    • 或者,如果某一行出现错误,则拒绝所有内容。


In your specific case, you have three kind of fields :


在您的特定情况下,您有三种字段:

Date;Id;Shown
15-Mar-10;231;345
15-Mar-10;232;346

From what I can guess :

从我能猜到的:

  • The first one must be a date
    • Using some regex to validate that will not be easy : there are not the same number of days each month, there are many months, there is not the same number of days in february depending on the year, ...
    • In such a case, I would probably try to parse the date with something like strtotime(not sure it's ok for the format you're using, though)
    • Or I would just explodethe string
      • making sure there are three parts
      • that the third one is 2 digits
      • that the second one is one of Jan, Feb, Mar, ...
      • That the first one corresponds to the correct number of days, depending on the two others
  • The second one :
    • must be an integer
    • must be a valid value, that exists in your database ?
      • If so, a simple SQL query will allow you to check that
  • For the third one, not really sure...
    • I'm guessing it has to be an integer ?
  • 第一个必须是日期
    • 使用一些正则表达式来验证这并不容易:每个月的天数不同,有很多个月,二月的天数因年份而异,......
    • 在这种情况下,我可能会尝试使用类似的内容来解析日期(但不确定您使用的格式是否合适)strtotime
    • 或者我只是explode字符串
      • 确保有三个部分
      • 第三个是2位数
      • 第二个是Jan, Feb, Mar, ... 之一
      • 第一个对应于正确的天数,取决于其他两个
  • 第二个 :
    • 必须是整数
    • 必须是数据库中存在的有效值?
      • 如果是这样,一个简单的 SQL 查询将允许您检查
  • 对于第三个,不太确定......
    • 我猜它必须是一个整数?

回答by TLiebe

You could used a regular expression to find rows that match (and therefore flag the ones that don't). Have a look at this link. That being said, you'll need to read through the whole file in order to validate it so you're probably better off just trying to parse it the first time through and catching any errors.

您可以使用正则表达式来查找匹配的行(因此标记不匹配的行)。看看这个链接。话虽如此,您需要通读整个文件以对其进行验证,因此您最好在第一次尝试解析它并捕获任何错误时进行。

回答by poke

Expect the data you are reading is valid, and simply ignore any lines that seem invalid or are of an unexpected format.

期望您正在阅读的数据是有效的,并且只需忽略任何看起来无效或格式意外的行。

CSV is used for data exchange or as a data storage. So it's very likely that it was already “valid”when the files was generated. If you – for whatever reason – have a CSV file as user input (the only real source where invalid or unexpected data can come from), there is no problem with ignoring that data and telling the user about the invalid lines.

CSV 用于数据交换或作为数据存储。所以很可能在生成文件时它已经“有效”了。如果您——无论出于何种原因——有一个 CSV 文件作为用户输入(无效或意外数据可能来自的唯一真正来源),忽略该数据并告诉用户无效行没有问题。

回答by roskakori

I wrote an open source Python tool to simplify validation of such files available from http://pypi.python.org/pypi/cutplace/.

我编写了一个开源 Python 工具来简化http://pypi.python.org/pypi/cutplace/提供的此类文件的验证。

The basic idea is that you describe the data format in a structured interface specification using OpenOffice.org, Excel or plain CSV. This is done in a few minutes and legible enough to serve as documentation too. We use it to validate files with about 200.000 rows on a daily base.

基本思想是使用 OpenOffice.org、Excel 或普通 CSV 在结构化接口规范中描述数据格式。这可以在几分钟内完成,并且足够清晰,也可以作为文档使用。我们每天使用它来验证大约 200.000 行的文件。

You can validate a CSV file using the command line:

您可以使用命令行验证 CSV 文件:

cutplace specification.csv data.csv

In case invalid data rows are found, the exit code is 1. If you need more control, you can write a little Python script that imports the cutplace module and adds a listener for validation events.

如果发现无效数据行,退出代码为 1。如果您需要更多控制,您可以编写一个小的 Python 脚本来导入 cutplace 模块并添加验证事件的侦听器。

As example, here's a specification that would validate the sample data you provided, filling the gaps of your short description by making a few assumptions. (I'm writing the specification in CSV to inline it in this post. In practice I prefer OpenOffice.org's Calc and ODS because I can use more formating and make it easier to read and maintain.)

例如,这里有一个规范,可以验证您提供的示例数据,通过做出一些假设来填补简短描述的空白。(我正在用 CSV 编写规范以将其内联在这篇文章中。实际上,我更喜欢 OpenOffice.org 的 Calc 和 ODS,因为我可以使用更多格式并使其更易于阅读和维护。)

,"Interface: Show statistics"
,
,"Data format"
"D","Format","CSV"
"D","Item delimiter",";"
"D","Header","1"
"D","Encoding","ASCII"
,
,"Fields"
,"Name","Example","Empty","Length","Type","Rule"
"F","date","15-Mar-10",,,"RegEx","\d\d-[A-Z][a-z][a-z]-\d\d"
"F","id","231",,,"Integer","0:"
"F","shown","345",,,"Integer","0:"
,
,"Checks"
,"Description","Type","Rule"
"C","id per date must be unique","IsUnique","date, id"

Lines starting with "D" describe the basic data format. In this case it is a CSV file using ";" as delimiter with 1 header line in ASCII encoding.

以“D”开头的行描述了基本数据格式。在这种情况下,它是一个使用“;”的 CSV 文件 作为 ASCII 编码中带有 1 个标题行的分隔符。

Lines starting with "F" describe the various fields. For example,

以“F”开头的行描述了各个字段。例如,

,"Name","Example","Empty","Length","Type","Rule"
"F","id","231",,,"Integer","0:"

defines a mandatory field "id" of type Integer with a value of 0 or greater. To allow the field to be empty, specify an "X" in the "Empty" column:

定义了一个值为 0 或更大的整数类型的必填字段“id”。要允许该字段为空,请在“空”列中指定一个“X”:

,"Name","Example","Empty","Length","Type","Rule"
"F","id","231","X",,"Integer","0:"

Finally there is an optional section to contain more advances checks spawning the whole file, not only single rows. For example, if each date in your file must provide date for an id only once, you can state this using:

最后,有一个可选部分包含更多生成整个文件的高级检查,而不仅仅是单行。例如,如果您的文件中的每个日期必须只为一个 id 提供一次日期,您可以使用以下方式声明:

,"Description","Type","Rule"
"C","id per date must be unique","IsUnique","date, id"

Any row that starts with an empty column can contain any text you like and will not be processed during validation. This is useful for headings, comments and so on.

任何以空列开头的行都可以包含您喜欢的任何文本,并且不会在验证期间进行处理。这对于标题、评论等很有用。