包含所有字符集以避免“java.nio.charset.MalformedInputException:Input length = 1”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26268132/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 02:07:26  来源:igfitidea点击:

All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

javacharacter-encoding

提问by Jonathan Lam

I'm creating a simple wordcount program in Java that reads through a directory's text-based files.

我正在用 Java 创建一个简单的 wordcount 程序,它读取目录的基于文本的文件。

However, I keep on getting the error:

但是,我不断收到错误消息:

java.nio.charset.MalformedInputException: Input length = 1

from this line of code:

从这行代码:

BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));

I know I probably get this because I used a Charsetthat didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.

我知道我可能会得到这个,因为我使用了一个Charset不包含文本文件中的一些字符,其中一些包含其他语言的字符。但我想包括这些字符。

I later learned at the JavaDocsthat the Charsetis optional and only used for a more efficient reading of the files, so I changed the code to:

后来我在JavaDocs了解到这Charset是可选的,仅用于更有效地读取文件,因此我将代码更改为:

BufferedReader reader = Files.newBufferedReader(file);

But some files still throw the MalformedInputException. I don't know why.

但是有些文件仍然抛出MalformedInputException. 我不知道为什么。

I was wondering if there is an all-inclusive Charsetthat will allow me to read text files with many different types of characters?

我想知道是否有一个包罗万象的方法Charset可以让我阅读具有许多不同类型字符的文本文件

Thanks.

谢谢。

采纳答案by Dawood ibn Kareem

You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException, try the next encoding.

您可能想要一个支持的编码列表。对于每个文件,依次尝试每种编码,也许从 UTF-8 开始。每次捕获 时MalformedInputException,请尝试下一个编码。

回答by Tom

I also encountered this exception with error message,

我也遇到了这个带有错误消息的异常,

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)

and found that some strange bug occurs when trying to use

并发现尝试使用时出现了一些奇怪的错误

BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));

to write a String "orazg 54" cast from a generic type in a class.

编写从类中的泛型类型转换的字符串“orazg 54”。

//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");

This String is of length 9 containing chars with the following code points:

此字符串的长度为 9,包含具有以下代码点的字符:

111 114 97 122 103 9 53 52 10

111 114 97 122 103 9 53 52 10

However, if the BufferedWriter in the class is replaced with:

但是,如果将类中的 BufferedWriter 替换为:

FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));

it can successfully write this String without exceptions. In addition, if I write the same String create from the characters it still works OK.

它可以毫无例外地成功写入此字符串。此外,如果我从字符中编写相同的 String create ,它仍然可以正常工作。

String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();

Previously I have never encountered any Exception when using the first BufferedWriter to write any Strings. It's a strange bug that occurs to BufferedWriter created from java.nio.file.Files.newBufferedWriter(path, options)

以前我在使用第一个 BufferedWriter 写入任何字符串时从未遇到任何异常。从 java.nio.file.Files.newBufferedWriter(path, options) 创建的 BufferedWriter 发生了一个奇怪的错误

回答by francesco foresti

Well, the problem is that Files.newBufferedReader(Path path)is implemented like this :

好吧,问题是它Files.newBufferedReader(Path path)是这样实现的:

public static BufferedReader newBufferedReader(Path path) throws IOException {
    return newBufferedReader(path, StandardCharsets.UTF_8);
}

so basically there is no point in specifying UTF-8unless you want to be descriptive in your code. If you want to try a "broader" charset you could try with StandardCharsets.UTF_16, but you can't be 100% sure to get every possible character anyway.

所以基本上没有必要指定,UTF-8除非您想在代码中进行描述。如果您想尝试“更广泛”的字符集,您可以尝试使用StandardCharsets.UTF_16,但无论如何您都不能 100% 确定获得所有可能的字符。

回答by Pengxiang

you can try something like this, or just copy and past below piece.

你可以尝试这样的事情,或者只是复制和过去下面的部分。

boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.        
int index = 0;

while(exception) {
    try {
        lines = Files.readAllLines(f.toPath(),charset);
          for (String line: lines) {
              line= line.trim();
              if(line.contains(keyword))
                  values.add(line);
              }           
        //No exception, just returns
        exception = false; 
    } catch (IOException e) {
        exception = true;
        //Try the next charset
        if(index<Charset.availableCharsets().values().size())
            charset = (Charset) Charset.availableCharsets().values().toArray()[index];
        index ++;
    }
}

回答by Xin Wang

Creating BufferedReader from Files.newBufferedReader

从 Files.newBufferedReader 创建 BufferedReader

Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);

when running the application it may throw the following exception:

运行应用程序时,它可能会抛出以下异常:

java.nio.charset.MalformedInputException: Input length = 1

But

new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));

works well.

效果很好。

The different is that, the former uses CharsetDecoder default action.

不同的是,前者使用 CharsetDecoder 默认动作。

The default action for malformed-input and unmappable-character errors is to reportthem.

错误输入和不可映射字符错误的默认操作是报告它们。

while the latter uses the REPLACE action.

而后者使用 REPLACE 操作。

cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)

回答by EngineerWithJava54321

I wrote the following to print a list of results to standard out based on available charsets. Note that it also tells you what line fails from a 0 based line number in case you are troubleshooting what character is causing issues.

我编写了以下内容以根据可用字符集将结果列表打印为标准输出。请注意,它还会告诉您从基于 0 的行号中哪一行失败,以防您对导致问题的字符进行故障排除。

public static void testCharset(String fileName) {
    SortedMap<String, Charset> charsets = Charset.availableCharsets();
    for (String k : charsets.keySet()) {
        int line = 0;
        boolean success = true;
        try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
            while (b.ready()) {
                b.readLine();
                line++;
            }
        } catch (IOException e) {
            success = false;
            System.out.println(k+" failed on line "+line);
        }
        if (success) 
            System.out.println("*************************  Successs "+k);
    }
}

回答by Tim Cooper

ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-

ISO-8859-1 是一个包罗万象的字符集,从这个意义上说,它保证不会抛出 MalformedInputException。所以它有利于调试,即使你的输入不在这个字符集中。所以:-

req.setCharacterEncoding("ISO-8859-1");

I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.

我的输入中有一些右双引号/左双引号字符,US-ASCII 和 UTF-8 都向它们抛出了 MalformedInputException,但 ISO-8859-1 有效。

回答by Vin

try this.. i had the same issue, below implementation worked for me

试试这个..我有同样的问题,下面的实现对我有用

Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);

then use Reader where ever you want.

然后在任何你想要的地方使用阅读器。

foreg:

例如:

CsvToBean<anyPojo> csvToBean = null;
    try {
        Reader reader = Files.newBufferedReader(Paths.get(csvFilePath), 
                        StandardCharsets.ISO_8859_1);
        csvToBean = new CsvToBeanBuilder(reader)
                .withType(anyPojo.class)
                .withIgnoreLeadingWhiteSpace(true)
                .withSkipLines(1)
                .build();

    } catch (IOException e) {
        e.printStackTrace();
    }

回答by Shahid Hussain Abbasi

ISO_8859_1 Worked for me! I was reading text file with comma separated values

ISO_8859_1 为我工作!我正在阅读带有逗号分隔值的文本文件

回答by Adriano

UTF-8 works for me with Polish characters

UTF-8 适用于我的波兰语字符