java 按字节截断字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3576754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Truncating Strings by Bytes
提问by stevebot
I create the following for truncating a string in java to a new string with a given number of bytes.
我创建以下用于将 java 中的字符串截断为具有给定字节数的新字符串。
String truncatedValue = "";
String currentValue = string;
int pivotIndex = (int) Math.round(((double) string.length())/2);
while(!truncatedValue.equals(currentValue)){
currentValue = string.substring(0,pivotIndex);
byte[] bytes = null;
bytes = currentValue.getBytes(encoding);
if(bytes==null){
return string;
}
int byteLength = bytes.length;
int newIndex = (int) Math.round(((double) pivotIndex)/2);
if(byteLength > maxBytesLength){
pivotIndex = newIndex;
} else if(byteLength < maxBytesLength){
pivotIndex = pivotIndex + 1;
} else {
truncatedValue = currentValue;
}
}
return truncatedValue;
This is the first thing that came to my mind, and I know I could improve on it. I saw another post that was asking a similar question there, but they were truncating Strings using the bytes instead of String.substring. I think I would rather use String.substring in my case.
这是我想到的第一件事,我知道我可以改进它。我在那里看到了另一篇提出类似问题的帖子,但他们使用字节而不是 String.substring 截断字符串。我想在我的情况下我宁愿使用 String.substring 。
EDIT: I just removed the UTF8 reference because I would rather be able to do this for different storage types aswell.
编辑:我刚刚删除了 UTF8 引用,因为我更愿意为不同的存储类型执行此操作。
回答by Rex Kerr
Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?
为什么不转换为字节并向前走——在你做的时候遵守 UTF8 字符边界——直到你得到最大数量,然后将这些字节转换回字符串?
Or you could just cut the original string if you keep track of where the cut should occur:
或者,如果您跟踪应该发生剪切的位置,则可以只剪切原始字符串:
// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
public static String cut(String s, int n) {
byte[] utf8 = s.getBytes();
if (utf8.length < n) n = utf8.length;
int n16 = 0;
int advance = 1;
int i = 0;
while (i < n) {
advance = 1;
if ((utf8[i] & 0x80) == 0) i += 1;
else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
else { i += 4; advance = 2; }
if (i <= n) n16 += advance;
}
return s.substring(0,n16);
}
}
Note: edited to fix bugs on 2014-08-25
注意:编辑以修复 2014-08-25 上的错误
回答by kan
The more sane solution is using decoder:
更明智的解决方案是使用解码器:
final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();
回答by Zsolt Taskai
I think Rex Kerr's solution has 2 bugs.
我认为 Rex Kerr 的解决方案有 2 个错误。
- First, it will truncate to limit+1 if a non-ASCII character is just before the limit. Truncating "123456789á1" will result in "123456789á" which is represented in 11 characters in UTF-8.
- Second, I think he misinterpreted the UTF standard. https://en.wikipedia.org/wiki/UTF-8#Descriptionshows that a 110xxxxx at the beginning of a UTF sequence tells us that the representation is 2 characters long (as opposed to 3). That's the reason his implementation usually doesn't use up all available space (as Nissim Avitan noted).
- 首先,如果非 ASCII 字符正好在限制之前,它将被截断为限制+1。截断“123456789á1”将导致“123456789á”,它在 UTF-8 中用 11 个字符表示。
- 其次,我认为他误解了 UTF 标准。https://en.wikipedia.org/wiki/UTF-8#Description显示 UTF 序列开头的 110xxxxx 告诉我们表示长度为 2 个字符(而不是 3 个)。这就是他的实现通常不会用完所有可用空间的原因(正如 Nissim Avitan 指出的那样)。
Please find my corrected version below:
请在下面找到我更正的版本:
public String cut(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return s;
}
int n16 = 0;
boolean extraLong = false;
int i = 0;
while (i < charLimit) {
// Unicode characters above U+FFFF need 2 words in utf16
extraLong = ((utf8[i] & 0xF0) == 0xF0);
if ((utf8[i] & 0x80) == 0) {
i += 1;
} else {
int b = utf8[i];
while ((b & 0x80) > 0) {
++i;
b = b << 1;
}
}
if (i <= charLimit) {
n16 += (extraLong) ? 2 : 1;
}
}
return s.substring(0, n16);
}
I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:
我仍然认为这远非有效。所以如果你真的不需要结果的字符串表示并且字节数组可以,你可以使用这个:
private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return utf8;
}
if ((utf8[charLimit] & 0x80) == 0) {
// the limit doesn't cut an UTF-8 sequence
return Arrays.copyOf(utf8, charLimit);
}
int i = 0;
while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
++i;
}
if ((utf8[charLimit-i-1] & 0x80) > 0) {
// we have to skip the starter UTF-8 byte
return Arrays.copyOf(utf8, charLimit-i-1);
} else {
// we passed all UTF-8 bytes
return Arrays.copyOf(utf8, charLimit-i);
}
}
Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IFyou create a string from the byte array again.
有趣的是,如果您再次从字节数组创建一个字符串,那么在实际的 20-500 字节限制下,它们的性能几乎相同。
Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.
请注意,这两种方法都假设有效的 utf-8 输入,这是使用 Java 的 getBytes() 函数后的有效假设。
回答by bmargulies
Use the UTF-8 CharsetEncoder, and encode until the output ByteBuffer contains as many bytes as you are willing to take, by looking for CoderResult.OVERFLOW.
使用 UTF-8 CharsetEncoder,并通过查找 CoderResult.OVERFLOW 进行编码,直到输出 ByteBuffer 包含您愿意接受的字节数。
回答by shadow
Second Approach here works good http://www.jroller.com/holy/entry/truncating_utf_string_to_the
第二种方法在这里效果很好 http://www.jroller.com/holy/entry/truncating_utf_string_to_the
回答by Ilya Lysenko
s = new String(s.getBytes("UTF-8"), 0, MAX_LENGTH - 2, "UTF-8");
s = new String(s.getBytes("UTF-8"), 0, MAX_LENGTH - 2, "UTF-8");
回答by Nissim Avitan
As noted, Peter Lawrey solution has major performance disadvantage (~3,500msc for 10,000 times), Rex Kerr was much better (~500msc for 10,000 times) but the result not was accurate - it cut much more than it needed (instead of remaining 4000 bytes it remainds 3500 for some example). attached here my solution (~250msc for 10,000 times) assuming that UTF-8 max length char in bytes is 4 (thanks WikiPedia):
如前所述,Peter Lawrey 解决方案具有主要的性能劣势(10,000 次约为 3,500msc),Rex Kerr 好得多(10,000 次约为 500msc)但结果并不准确 - 它比需要的减少了很多(而不是剩余的 4000对于某些示例,它仍然是 3500 个字节)。在这里附上我的解决方案(约 250msc 10,000 次),假设 UTF-8 最大长度字符(以字节为单位)为 4(感谢 WikiPedia):
public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
double MAX_UTF8_CHAR_LENGTH = 4.0;
if(word.length()>dbLimit){
word = word.substring(0, dbLimit);
}
if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
int residual=word.getBytes("UTF-8").length-dbLimit;
if(residual>0){
int tempResidual = residual,start, end = word.length();
while(tempResidual > 0){
start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
end=start;
}
word = word.substring(0, end);
}
}
return word;
}
回答by Peter Lawrey
you could convert the string to bytes and convert just those bytes back to a string.
您可以将字符串转换为字节并将这些字节转换回字符串。
public static String substring(String text, int maxBytes) {
StringBuilder ret = new StringBuilder();
for(int i = 0;i < text.length(); i++) {
// works out how many bytes a character takes,
// and removes these from the total allowed.
if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
ret.append(text.charAt(i));
}
return ret.toString();
}
回答by Сергей Сенько
This is my :
这是我的 :
private static final int FIELD_MAX = 2000;
private static final Charset CHARSET = Charset.forName("UTF-8");
public String trancStatus(String status) {
if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) {
int maxLength = FIELD_MAX;
int left = 0, right = status.length();
int index = 0, bytes = 0, sizeNextChar = 0;
while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) {
index = left + (right - left) / 2;
bytes = status.substring(0, index).getBytes(CHARSET).length;
sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length;
if (bytes < maxLength) {
left = index - 1;
} else {
right = index + 1;
}
}
return status.substring(0, index);
} else {
return status;
}
}
回答by Gokul Limbe
By using below Regular Expression also you can remove leading and trailing white space of double byte character.
通过使用下面的正则表达式,您还可以删除双字节字符的前导和尾随空格。
stringtoConvert = stringtoConvert.replaceAll("^[\s ]*", "").replaceAll("[\s ]*$", "");

