java 如何检查文件是否为纯文本？

Question

提问by Renato Dinhani

In my program, the user can load a file with links (it's a webcrawler), but I need to verify if the file that the user chooses is plain text or something else (only plain text will be allowed).

在我的程序中，用户可以加载带有链接的文件（它是一个网络爬虫），但我需要验证用户选择的文件是纯文本还是其他内容（只允许纯文本）。

Is it possible to do this? If it's useful, I'm using JFileChooser to open the file.

是否有可能做到这一点？如果有用，我将使用 JFileChooser 打开文件。

EDIT:

编辑：

What is expected from the user: a text file containing URLs.

用户期望什么：包含 URL 的文本文件。

What I want to avoid: the user loads an MP3 file or a document from the MS Word (examples).

我想避免的是：用户从 MS Word 加载 MP3 文件或文档（示例）。

Answer 1

采纳答案by tdammers

A file is just a series of bytes, and without further information, you cannot tell whether these bytes are supposed to be code points in some string encoding (say, ASCII or UTF-8 or ANSI-something) or something else. You will have to resort to heuristics, such as:

文件只是一系列字节，如果没有更多信息，您无法判断这些字节是否应该是某种字符串编码（例如 ASCII 或 UTF-8 或 ANSI 之类的）或其他东西中的代码点。您将不得不求助于启发式方法，例如：

Try to parse the file in a number of known encodings and see if the parsing succeeds. If it does, chances are you have a text file.
If you expect text files in Western languages only, you can assume that the majority of characters lies in the ASCII range (0..127), more specifically, (33..127) plus whitespace (tab, newline, carriage return, space). Count occurrences of each distinct byte value, and if the overwhelming part of your document is in the 'typical western characters' set, it's usually safe to assume it's a text file.
Extending the previous approach; sample a sufficiently large quantity of text in the languages you expect, and build a character frequency profile. To check your file, compare the file's character frequency profile against your test data and see if it's close enough.

尝试以多种已知编码解析文件并查看解析是否成功。如果是这样，很可能你有一个文本文件。
如果您只希望使用西方语言的文本文件，您可以假设大部分字符位于 ASCII 范围 (0..127)，更具体地说，是 (33..127) 加上空格（制表符、换行符、回车符、空格））。计算每个不同字节值的出现次数，如果您的文档的大部分内容都在“典型的西方字符”集中，通常可以安全地假设它是一个文本文件。
扩展以前的方法；以您期望的语言对足够大量的文本进行采样，并构建字符频率配置文件。要检查您的文件，请将文件的字符频率配置文件与您的测试数据进行比较，看看它是否足够接近。

But here's another solution: Just treat everything you receive as text, applying the necessary transformations where needed (e.g. HTML-encode when sending to a web browser). As long as you prevent the file from being interpreted as binary data (such as a user double-clicking the file), the worst you'll produce is gibberish data.

但这是另一种解决方案：只需将您收到的所有内容都视为文本，在需要的地方应用必要的转换（例如，在发送到 Web 浏览器时进行 HTML 编码）。只要您防止文件被解释为二进制数据（例如用户双击文件），您将产生的最糟糕的数据是乱码数据。

Answer 2

回答by Kerrek SB

Text is also a form of binary data.

文本也是二进制数据的一种形式。

I suppose what you want to check is whether there are any characters in your input that are < 32. If you can safely assume that your text is multi-byte encoded, then you could just scan through the entire file and abort if you hit a byte in the range [0, 32) (excluding 9, 10, 13, and whatever else you may except in "text" -- or worst-case onlycheck for null bytes [thanks, tdammers!]). If you could plausibly expect to receive UTF-16 or UTF-32 encoded text, you'll have to work harder.

我想您要检查的是您的输入中是否有任何 < 32 的字符。如果您可以安全地假设您的文本是多字节编码的，那么您可以扫描整个文件并在您点击时中止范围 [0, 32) 中的字节（不包括9、10、13 ，以及除“文本”之外的任何其他内容——或者在最坏的情况下只检查空字节 [谢谢，tdammers！]）。如果您可以合理地期望收到 UTF-16 或 UTF-32 编码的文本，您将不得不更加努力地工作。

Answer 3

回答by Arne Burmeister

If you do not want to guess by file extension, you may read the first portion of the file. But the next problem will be the character encoding. Using a BufferedInputStream(mark()before and reset()afterwards), wrap with a InputStreamReaderwith encoding "ISO-8859-1"and count the read character with Character.isLetterOrDigit()or Character.isWhitespace()to get a ratio of typical text content. I think the ratio should be more than 80% for a text file.

如果您不想通过文件扩展名猜测，您可以阅读文件的第一部分。但下一个问题将是字符编码。使用BufferedInputStream（mark()之前和reset()之后），用InputStreamReader带编码包裹"ISO-8859-1"并用Character.isLetterOrDigit()或计算读取的字符Character.isWhitespace()以获得典型文本内容的比率。我认为文本文件的比例应该超过 80%。

You can also try other encoding like UTF-8, but you may get problems with invalid caracters when it is not UTF-8.

您也可以尝试其他编码，如 UTF-8，但当它不是 UTF-8 时，您可能会遇到无效字符的问题。

Answer 4

回答by rossum

You can also check to see if the initial bytes are a BoM, which should indicate a file in UTF:

您还可以检查初始字节是否是 BoM，它应该指示 UTF 格式的文件：

- UTF-8     => 0xEF, 0xBB, 0xBF
- UTF-16 BE => 0xFE, 0xFF
- UTF-16 LE => 0xFF, 0xFE

rossum

罗苏姆

Answer 5

回答by Steinway Wu

You can call the shell command file -i ${filename}from Java, and check the output to see if it contains something like charset=binary. If it does, then it is binary file. Otherwise it is text based file.

您可以file -i ${filename}从 Java调用 shell 命令，并检查输出以查看它是否包含类似charset=binary. 如果是，那么它是二进制文件。否则它是基于文本的文件。

You can play with filein the shell on various files and get familiar with it. In groovy I will write something like

您可以file在 shell 中对各种文件进行操作并熟悉它。在 groovy 中，我会写一些类似的东西

'file -i ${path/to/myfile}'.execute().getText().contains('charset=binary')

In Java you can also call shell commands. Please refer to this.

在 Java 中，您还可以调用 shell 命令。请参考这个。

Answer 6

回答by Scott C Wilson

You should create a filter that looks at the file description, and check for text.

您应该创建一个查看文件描述的过滤器，并检查文本。

java 如何检查文件是否为纯文本？

提问by Renato Dinhani

采纳答案by tdammers

回答by Kerrek SB

回答by Arne Burmeister

回答by rossum

回答by Steinway Wu

回答by Scott C Wilson

相关推荐

最近更新

标签

java 如何检查文件是否为纯文本？

提问by Renato Dinhani

采纳答案by tdammers

回答by Kerrek SB

回答by Arne Burmeister

回答by rossum

回答by Steinway Wu

回答by Scott C Wilson

相关推荐

使用画布和表面视图在 Android 上使用 Java 进行双缓冲

Java 的 Calendar 类中的 Calendar.WEEK_OF_MONTH 和 Calendar.DAY_OF_WEEK_IN_MONTH 有什么区别？

java JComboBox 动作监听器

如何使用 Java 加载图像并向其写入文本？

相关推荐

最近更新

标签