使用 Java 验证 CSV 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/644539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
CSV file validation with Java
提问by
I'm reading a file line by line, like this:
我正在逐行读取文件,如下所示:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
但是如果上传其他类型的文件,它仍然会读取它们的内容。比如上传的文件是图片,读取文件时会输出垃圾字符。所以我的问题是:在阅读文件之前,我如何确定该文件是 CSV 文件?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
检查文件的扩展名有点蹩脚,因为有人可以上传不是 CSV 但具有 .csv 扩展名的文件。提前致谢。
回答by VonC
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
确定文件的 MIME 类型并非易事,尤其是当 ASCII 部分可以与二进制部分混合时。
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
实际上,当您查看 Java 邮件系统如何确定电子邮件的 MIME 类型时,它确实涉及读取其中的所有字节,并应用一些“规则”。
查看MimeUtility.java
- If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
- If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
- If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
- If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
- If there is even one non-US-ASCII character, the encoding is "base64".
@return"7bit", "quoted-printable" or "base64"
- 如果此数据源的主要类型是“text”,并且其输入流中的所有字节都是 US-ASCII,则编码为“7bit”。
- 如果超过一半的字节是非 US-ASCII,则编码为“base64”。
- 如果少于一半的字节是非 US-ASCII,则编码是“引用可打印的”。
- 如果此数据源的主要类型不是“文本”,则如果其输入流的所有字节都是 US-ASCII,则编码为“7bit”。
- 如果甚至有一个非 US-ASCII 字符,则编码为“base64”。
@return“7bit”、“quoted-printable”或“base64”
As mentioned by mmyersin a deleted comment, JavaMimeTypeis supposed to do the same thing, but:
正如mmyers在已删除的评论中提到的,JavaMimeType应该做同样的事情,但是:
- it is dead since 2006
- it does involve reading the all content!
- 它自 2006 年就死了
- 它确实涉及阅读所有内容!
:
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
所以...因为无论如何您都在阅读文件的所有内容,所以您可以利用它来根据该内容和您自己的规则确定类型。
回答by Brian Agnew
Java Mime Magicmay be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
Java Mime Magic可能有用。它将分析来自文件和输入流的 MIME 类型。但是,我不能保证它的功能。
This linkmay provide further info. It provides several different means of determining how to do what you want (or at least something similar).
此链接可能会提供更多信息。它提供了几种不同的方法来确定如何做您想做的事情(或至少是类似的事情)。
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone mayupload a file that starts like a CSV but is corrupted half-way through.
我可能会想写一些特定于您的问题域的东西。例如,确定每行逗号分隔值的数量,如果不在特定范围内则拒绝。然后在逗号上拆分并根据要求解析每个条目(例如它们是双精度/浮点数/有效字符串 - 如果是字符串,则是什么编码)。我认为无论如何您可能都必须这样做,因为有人可能会上传一个像 CSV 一样开头但在中途损坏的文件。

