以 1024 字节的块分割 Java 字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/520907/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split Java String in chunks of 1024 bytes
提问by user54729
What's an efficient way of splitting a String into chunks of 1024 bytes in java? If there is more than one chunk then the header(fixed size string) needs to be repeated in all subsequent chunks.
在 java 中将字符串拆分为 1024 个字节的块的有效方法是什么?如果有多个块,则需要在所有后续块中重复标头(固定大小的字符串)。
采纳答案by Michael Borgwardt
Strings and bytes are two completely different things, so wanting to split a String into bytes is as meaningless as wanting to split a painting into verses.
字符串和字节是两个完全不同的东西,所以想把一个字符串分割成字节就像想把一幅画分割成诗一样毫无意义。
What is it that you actually want to do?
你真正想做的是什么?
To convert between strings and bytes, you need to specify an encoding that can encode all the characters in the String. Depending on the encoding and the characters, some of them may span more than one byte.
要在字符串和字节之间进行转换,您需要指定可以对字符串中的所有字符进行编码的编码。根据编码和字符的不同,其中一些可能跨越一个以上的字节。
You can either split the String into chunks of 1024 characters and encode those as bytes, but then each chunk may be more than 1024 bytes.
您可以将字符串拆分为 1024 个字符的块并将它们编码为字节,但每个块可能超过 1024 个字节。
Or you can encode the original string into bytes and then split them into chunks of 1024, but then you have to make sure to append them as bytes before decoding the whole into a String again, or you may get garbled characters at the split points when a character spans more than 1 byte.
或者您可以将原始字符串编码为字节,然后将它们拆分为 1024 的块,但是您必须确保在将整个字符串再次解码为字符串之前将它们附加为字节,否则您可能会在拆分点出现乱码一个字符超过 1 个字节。
If you're worried about memory usage when the String can be very long, you should use streams (java.io package) to to the en/decoding and splitting, in order to avoid keeping the data in memory several times as copies. Ideally, you should avoid having the original String in one piece at all and instead use streams to read it in small chunks from wherever you get it from.
如果您担心 String 可能很长时的内存使用情况,您应该使用流(java.io 包)进行编码/解码和拆分,以避免将数据作为副本多次保存在内存中。理想情况下,您应该完全避免将原始 String 放在一起,而是使用流从任何地方以小块形式读取它。
回答by Aaron Digulla
You have two ways, the fast and the memory conservative way. But first, you need to know what characters are in the String. ASCII? Are there umlauts (characters between 128 and 255) or even Unicode (s.getChar() returns something > 256). Depending on that, you will need to use a different encoding. If you have binary data, try "iso-8859-1" because it will preserve the data in the String. If you have Unicode, try "utf-8". I'll assume binary data:
您有两种方式,快速方式和内存保守方式。但首先,您需要知道 String 中有哪些字符。ASCII 码?是否有变音(128 到 255 之间的字符)甚至 Unicode(s.getChar() 返回大于 256 的值)。根据这一点,您将需要使用不同的编码。如果您有二进制数据,请尝试“iso-8859-1”,因为它将保留字符串中的数据。如果您有 Unicode,请尝试“utf-8”。我将假设二进制数据:
String encoding = "iso-8859-1";
The fastest way:
最快的方法:
ByteArrayInputStream in = new ByteArrayInputStream (string.getBytes(encoding));
Note that the String is Unicode, so every character needs twobytes. You will have to specify the encoding (don't rely on the "platform default". This will only cause pain later).
请注意,字符串是 Unicode,因此每个字符都需要两个字节。您必须指定编码(不要依赖“平台默认值”。这只会在以后引起痛苦)。
Now you can read it in 1024 chunks using
现在您可以使用 1024 个块读取它
byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) > 0) { ... }
This needs about three times as much RAM as the original String.
这需要大约三倍于原始字符串的 RAM。
A more memory conservative way is to write a converter which takes a StringReader and an OutputStreamWriter (which wraps a ByteArrayOutputStream). Copy bytes from the reader to the writer until the underlying buffer contains one chunk of data:
更节省内存的方法是编写一个转换器,它采用 StringReader 和 OutputStreamWriter(包装 ByteArrayOutputStream)。将字节从读取器复制到写入器,直到底层缓冲区包含一个数据块:
When it does, copy the data to the real output (prepending the header), copy the additional bytes (which the Unicode->byte conversion may have generated) to a temp buffer, call buffer.reset() and write the temp buffer to buffer.
当它这样做时,将数据复制到实际输出(在标题前面),将附加字节(Unicode-> 字节转换可能已生成)复制到临时缓冲区,调用 buffer.reset() 并将临时缓冲区写入缓冲。
Code looks like this (untested):
代码如下(未经测试):
StringReader r = new StringReader (string);
ByteArrayOutputStream buffer = new ByteArrayOutputStream (1024*2); // Twice as large as necessary
OutputStreamWriter w = new OutputStreamWriter (buffer, encoding);
char[] cbuf = new char[100];
byte[] tempBuf;
int len;
while ((len = r.read(cbuf, 0, cbuf.length)) > 0) {
w.write(cbuf, 0, len);
w.flush();
if (buffer.size()) >= 1024) {
tempBuf = buffer.toByteArray();
... ready to process one chunk ...
buffer.reset();
if (tempBuf.length > 1024) {
buffer.write(tempBuf, 1024, tempBuf.length - 1024);
}
}
}
... check if some data is left in buffer and process that, too ...
This only needs a couple of kilobytes of RAM.
这只需要几千字节的 RAM。
[EDIT] There has been a lengthy discussion about binary data in Strings in the comments. First of all, it's perfectly safe to put binary data into a String as long as you are careful when creating it and storing it somewhere. To create such a String, take a byte[] array and:
[编辑] 在注释中对字符串中的二进制数据进行了长时间的讨论。首先,只要在创建和存储它时小心谨慎,将二进制数据放入 String 是完全安全的。要创建这样的字符串,请使用 byte[] 数组并:
String safe = new String (array, "iso-8859-1");
In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping. This means the bytes in the array will not be interpreted in any way. Now you can use substring() and the like on the data or search it with index, run regexp's on it, etc. For example, find the position of a 0-byte:
在 Java 中,ISO-8859-1(又名 ISO-Latin1)是一个 1:1 的映射。这意味着不会以任何方式解释数组中的字节。现在您可以在数据上使用 substring() 等或使用索引搜索它,在其上运行正则表达式等。例如,找到 0 字节的位置:
int pos = safe.indexOf('\u0000');
This is especially useful if you don't know the encoding of the data and want to have a look at it before some codec messes with it.
如果您不知道数据的编码并且想在某些编解码器弄乱它之前查看它,这将特别有用。
To write the data somewhere, the reverse operation is:
将数据写入某处,反向操作为:
byte[] data = safe.getBytes("iso-8859-1");
byte[] data = safe.getBytes("iso-8859-1");
Never use the default methods new String(array)
or String.getBytes()
!One day, your code is going to be executed on a different platform and it will break.
切勿使用默认方法new String(array)
或String.getBytes()
! 有一天,您的代码将在不同的平台上执行,并且会中断。
Now the problem of characters > 255 in the String. If you use this method, you won't ever have any such character in your Strings. That said, if there were any for some reason, then getBytes() would throw an Exception because there is no way to express all Unicode characters in ISO-Latin1, so you're safe in the sense that the code will not fail silently.
现在字符串中字符> 255 的问题。如果您使用此方法,您的字符串中将永远不会有任何此类字符。也就是说,如果由于某种原因有任何异常,那么 getBytes() 将抛出异常,因为无法在 ISO-Latin1 中表达所有 Unicode 字符,因此您很安全,因为代码不会静默失败。
Some might argue that this is not safe enough and you should never mix bytes and String. In this day an age, we don't have that luxury. A lot of data has no explicit encoding information (files, for example, don't have an "encoding" attribute in the same way as they have access permissions or a name). XML is one of the few formats which has explicit encoding information and there are editors like Emacs or jEdit which use comments to specify this vital information. This means that, when processing streams of bytes, you must always know in which encoding they are. As of now, it's not possible to write code which will always work, no matter where the data comes from.
有些人可能会争辩说这不够安全,你永远不应该混合字节和字符串。在这个时代,我们没有那种奢侈。许多数据没有明确的编码信息(例如,文件没有“编码”属性,就像它们具有访问权限或名称一样)。XML 是少数具有显式编码信息的格式之一,并且有像 Emacs 或 jEdit 这样的编辑器使用注释来指定这些重要信息。这意味着,在处理字节流时,您必须始终知道它们的编码方式。到目前为止,无论数据来自何处,都无法编写始终有效的代码。
Even with XML, you must read the header of the file as bytes to determine the encoding before you can decode the meat.
即使使用 XML,您也必须将文件头作为字节读取以确定编码,然后才能解码肉。
The important point is to sit down and figure out which encoding was used to generate the data stream you have to process. If you do that, you're good, if you don't, you're doomed. The confusion originates from the fact that most people are not aware that the same byte can mean different things depending on the encoding or even that there is more than one encoding. Also, it would have helped if Sun hadn't introduced the notion of "platform default encoding."
重要的一点是坐下来弄清楚使用哪种编码来生成您必须处理的数据流。如果你这样做,你很好,如果你不这样做,你注定要失败。混淆源于这样一个事实,即大多数人不知道同一个字节可能意味着不同的东西,这取决于编码,甚至有不止一种编码。此外,如果 Sun 没有引入“平台默认编码”的概念,它也会有所帮助。
Important points for beginners:
初学者的要点:
- There is more than one encoding (charset).
- There are more characters than the English language uses. There are even several sets of digits(ASCII, full width, Arabic-Indic, Bengali).
- You must know which encoding was used to generate the data which you are processing.
- You must know which encoding you should use to write the data you are processing.
- You must know the correct way to specify this encoding information so the next program can decode your output (XML header, HTML meta tag, special encoding comment, whatever).
- 有不止一种编码(字符集)。
- 有比英语使用更多的字符。甚至还有几组数字(ASCII、全角、阿拉伯-印度语、孟加拉语)。
- 您必须知道使用哪种编码来生成您正在处理的数据。
- 您必须知道应该使用哪种编码来写入正在处理的数据。
- 您必须知道指定此编码信息的正确方法,以便下一个程序可以解码您的输出(XML 标头、HTML 元标记、特殊编码注释等)。
The days of ASCII are over.
ASCII 的时代结束了。
回答by Alan Deep
I know I am late, however I was looking for a solution myself and then found my answer as best answer:
我知道我迟到了,但是我自己也在寻找解决方案,然后找到了我的最佳答案:
private static String chunk_split(String original, int length, String separator) throws IOException {
ByteArrayInputStream bis = new ByteArrayInputStream(original.getBytes());
int n = 0;
byte[] buffer = new byte[length];
String result = "";
while ((n = bis.read(buffer)) > 0) {
for (byte b : buffer) {
result += (char) b;
}
Arrays.fill(buffer, (byte) 0);
result += separator;
}
return result;
}
Example:
示例:
public static void main(String[] args) throws IOException{
String original = "abcdefghijklmnopqrstuvwxyz";
System.out.println(chunk_split(original,5,"\n"));
}
Output:
输出:
abced
fghij
klmno
pqrst
uvwxy
z
回答by SureshCS50
I was trying this for myself, I need to chunk a huge String (nearly 10 MB) by 1 MB. This helps chunk the data in minimal amount of time. (less than a second).
我正在为自己尝试这个,我需要将一个巨大的字符串(近 10 MB)分块 1 MB。这有助于在最短的时间内对数据进行分块。(不到一秒钟)。
private static ArrayList<String> chunkLogMessage(String logMessage) throws Exception {
ArrayList<String> messages = new ArrayList<>();
if(logMessage.getBytes().length > CHUNK_SIZE) {
Log.e("chunk_started", System.currentTimeMillis()+"");
byte[] buffer = new byte[CHUNK_SIZE];
int start = 0, end = buffer.length;
long remaining = logMessage.getBytes().length;
ByteArrayInputStream inputStream = new ByteArrayInputStream(logMessage.getBytes());
while ((inputStream.read(buffer, start, end)) != -1){
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
outputStream.write(buffer, start, end);
messages.add(outputStream.toString("UTF-8"));
remaining = remaining - end;
if(remaining <= end){
end = (int) remaining;
}
}
Log.e("chunk_ended", System.currentTimeMillis()+"");
return messages;
}
messages.add(logMessage);
return messages;
}
Logcat:
日志猫:
22:08:00.262 3382-3425/com.sample.app E/chunk_started: 1533910080261
22:08:01.228 3382-3425/com.sample.app E/chunk_ended: 1533910081228
22:08:02.468 3382-3425/com.sample.app E/chunk_started: 1533910082468
22:08:03.478 3382-3425/com.sample.app E/chunk_ended: 1533910083478
22:09:19.801 3382-3382/com.sample.app E/chunk_started: 1533910159801
22:09:20.662 3382-3382/com.sample.app E/chunk_ended: 1533910160662
回答by gbenroscience
Yes, most if not all the above would definitely work.
是的,大多数(如果不是全部)以上肯定会起作用。
Or you could check out thisproject which does exactly that; only it is able to chunk not just strings, but also byte arrays, inputstreams and files.
或者你可以查看这个项目,它正是这样做的;只有它不仅可以对字符串进行分块,还可以对字节数组、输入流和文件进行分块。
It has 2 classes: DataChunker
and StringChunker
它有 2 个类:DataChunker
和StringChunker
DataChunker chunker = new DataChunker(8192, blob) {
@Override
public void chunkFound(byte[] foundChunk, int bytesProcessed) {
//process chunk here
}
@Override
public void chunksExhausted(int bytesProcessed) {
//called when all the blocks have been exhausted
}
};
String blob = "Experience is wasted if history does not repeat itself...Gbemiro Jiboye";
final StringBuilder builder = new StringBuilder();
StringChunker chunker = new StringChunker(4, blob) {
@Override
public void chunkFound(String foundChunk, int bytesProcessed) {
builder.append(foundChunk);
System.out.println("Found: "+foundChunk+", bytesProcessed: "+bytesProcessed+" bytes");
}
@Override
public void chunksExhausted(int bytesProcessed) {
System.out.println("Processed all of: "+bytesProcessed+" bytes. Rebuilt string is: "+builder.toString());
}
};
The blob
in the constructor Datachunker's
constructor is either a byte array, a File
or an InputStream
的blob
在构造Datachunker's
的构造可以是一个字节数组,一个File
或一个InputStream