java 按字节截断字符串

Question

提问by stevebot

I create the following for truncating a string in java to a new string with a given number of bytes.

我创建以下用于将 java 中的字符串截断为具有给定字节数的新字符串。

        String truncatedValue = "";
        String currentValue = string;
        int pivotIndex = (int) Math.round(((double) string.length())/2);
        while(!truncatedValue.equals(currentValue)){
            currentValue = string.substring(0,pivotIndex);
            byte[] bytes = null;
            bytes = currentValue.getBytes(encoding);
            if(bytes==null){
                return string;
            }
            int byteLength = bytes.length;
            int newIndex =  (int) Math.round(((double) pivotIndex)/2);
            if(byteLength > maxBytesLength){
                pivotIndex = newIndex;
            } else if(byteLength < maxBytesLength){
                pivotIndex = pivotIndex + 1;
            } else {
                truncatedValue = currentValue;
            }
        }
        return truncatedValue;

This is the first thing that came to my mind, and I know I could improve on it. I saw another post that was asking a similar question there, but they were truncating Strings using the bytes instead of String.substring. I think I would rather use String.substring in my case.

这是我想到的第一件事，我知道我可以改进它。我在那里看到了另一篇提出类似问题的帖子，但他们使用字节而不是 String.substring 截断字符串。我想在我的情况下我宁愿使用 String.substring 。

EDIT: I just removed the UTF8 reference because I would rather be able to do this for different storage types aswell.

编辑：我刚刚删除了 UTF8 引用，因为我更愿意为不同的存储类型执行此操作。

Answer 1

回答by Rex Kerr

Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?

为什么不转换为字节并向前走——在你做的时候遵守 UTF8 字符边界——直到你得到最大数量，然后将这些字节转换回字符串？

Or you could just cut the original string if you keep track of where the cut should occur:

或者，如果您跟踪应该发生剪切的位置，则可以只剪切原始字符串：

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    int advance = 1;
    int i = 0;
    while (i < n) {
      advance = 1;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
      else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
      else { i += 4; advance = 2; }
      if (i <= n) n16 += advance;
    }
    return s.substring(0,n16);
  }
}

^{Note: edited to fix bugs on 2014-08-25}

^{注意：编辑以修复 2014-08-25 上的错误}

Answer 2

回答by kan

The more sane solution is using decoder:

更明智的解决方案是使用解码器：

final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();

Answer 3

回答by Zsolt Taskai

I think Rex Kerr's solution has 2 bugs.

我认为 Rex Kerr 的解决方案有 2 个错误。

First, it will truncate to limit+1 if a non-ASCII character is just before the limit. Truncating "123456789á1" will result in "123456789á" which is represented in 11 characters in UTF-8.
Second, I think he misinterpreted the UTF standard. https://en.wikipedia.org/wiki/UTF-8#Descriptionshows that a 110xxxxx at the beginning of a UTF sequence tells us that the representation is 2 characters long (as opposed to 3). That's the reason his implementation usually doesn't use up all available space (as Nissim Avitan noted).

首先，如果非 ASCII 字符正好在限制之前，它将被截断为限制+1。截断“123456789á1”将导致“123456789á”，它在 UTF-8 中用 11 个字符表示。
其次，我认为他误解了 UTF 标准。https://en.wikipedia.org/wiki/UTF-8#Description显示 UTF 序列开头的 110xxxxx 告诉我们表示长度为 2 个字符（而不是 3 个）。这就是他的实现通常不会用完所有可用空间的原因（正如 Nissim Avitan 指出的那样）。

Please find my corrected version below:

请在下面找到我更正的版本：

public String cut(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return s;
    }
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < charLimit) {
        // Unicode characters above U+FFFF need 2 words in utf16
        extraLong = ((utf8[i] & 0xF0) == 0xF0);
        if ((utf8[i] & 0x80) == 0) {
            i += 1;
        } else {
            int b = utf8[i];
            while ((b & 0x80) > 0) {
                ++i;
                b = b << 1;
            }
        }
        if (i <= charLimit) {
            n16 += (extraLong) ? 2 : 1;
        }
    }
    return s.substring(0, n16);
}

I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:

我仍然认为这远非有效。所以如果你真的不需要结果的字符串表示并且字节数组可以，你可以使用这个：

private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return utf8;
    }
    if ((utf8[charLimit] & 0x80) == 0) {
        // the limit doesn't cut an UTF-8 sequence
        return Arrays.copyOf(utf8, charLimit);
    }
    int i = 0;
    while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
        ++i;
    }
    if ((utf8[charLimit-i-1] & 0x80) > 0) {
        // we have to skip the starter UTF-8 byte
        return Arrays.copyOf(utf8, charLimit-i-1);
    } else {
        // we passed all UTF-8 bytes
        return Arrays.copyOf(utf8, charLimit-i);
    }
}

Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IFyou create a string from the byte array again.

有趣的是，如果您再次从字节数组创建一个字符串，那么在实际的 20-500 字节限制下，它们的性能几乎相同。

Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.

请注意，这两种方法都假设有效的 utf-8 输入，这是使用 Java 的 getBytes() 函数后的有效假设。

Answer 4

回答by bmargulies

Use the UTF-8 CharsetEncoder, and encode until the output ByteBuffer contains as many bytes as you are willing to take, by looking for CoderResult.OVERFLOW.

使用 UTF-8 CharsetEncoder，并通过查找 CoderResult.OVERFLOW 进行编码，直到输出 ByteBuffer 包含您愿意接受的字节数。

Answer 5

回答by shadow

Second Approach here works good http://www.jroller.com/holy/entry/truncating_utf_string_to_the

第二种方法在这里效果很好 http://www.jroller.com/holy/entry/truncating_utf_string_to_the

Answer 6

回答by Ilya Lysenko

s = new String(s.getBytes("UTF-8"), 0, MAX_LENGTH - 2, "UTF-8");

Answer 7

回答by Nissim Avitan

As noted, Peter Lawrey solution has major performance disadvantage (~3,500msc for 10,000 times), Rex Kerr was much better (~500msc for 10,000 times) but the result not was accurate - it cut much more than it needed (instead of remaining 4000 bytes it remainds 3500 for some example). attached here my solution (~250msc for 10,000 times) assuming that UTF-8 max length char in bytes is 4 (thanks WikiPedia):

如前所述，Peter Lawrey 解决方案具有主要的性能劣势（10,000 次约为 3,500msc），Rex Kerr 好得多（10,000 次约为 500msc）但结果并不准确 - 它比需要的减少了很多（而不是剩余的 4000对于某些示例，它仍然是 3500 个字节）。在这里附上我的解决方案（约 250msc 10,000 次），假设 UTF-8 最大长度字符（以字节为单位）为 4（感谢 WikiPedia）：

public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
    double MAX_UTF8_CHAR_LENGTH = 4.0;
    if(word.length()>dbLimit){
        word = word.substring(0, dbLimit);
    }
    if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
        int residual=word.getBytes("UTF-8").length-dbLimit;
        if(residual>0){
            int tempResidual = residual,start, end = word.length();
            while(tempResidual > 0){
                start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
                tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
                end=start;
            }
            word = word.substring(0, end);
        }
    }
    return word;
}

Answer 8

回答by Peter Lawrey

you could convert the string to bytes and convert just those bytes back to a string.

您可以将字符串转换为字节并将这些字节转换回字符串。

public static String substring(String text, int maxBytes) {
   StringBuilder ret = new StringBuilder();
   for(int i = 0;i < text.length(); i++) {
       // works out how many bytes a character takes, 
       // and removes these from the total allowed.
       if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
       ret.append(text.charAt(i));
   }
   return ret.toString();
}

Answer 9

回答by Сергей Сенько

This is my :

这是我的：

private static final int FIELD_MAX = 2000;
private static final Charset CHARSET =  Charset.forName("UTF-8"); 

public String trancStatus(String status) {

    if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) {
        int maxLength = FIELD_MAX;

        int left = 0, right = status.length();
        int index = 0, bytes = 0, sizeNextChar = 0;

        while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) {

            index = left + (right - left) / 2;

            bytes = status.substring(0, index).getBytes(CHARSET).length;
            sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length;

            if (bytes < maxLength) {
                left = index - 1;
            } else {
                right = index + 1;
            }
        }

        return status.substring(0, index);

    } else {
        return status;
    }
}

Answer 10

回答by Gokul Limbe

By using below Regular Expression also you can remove leading and trailing white space of double byte character.

通过使用下面的正则表达式，您还可以删除双字节字符的前导和尾随空格。

stringtoConvert = stringtoConvert.replaceAll("^[\s　]*", "").replaceAll("[\s　]*$", "");

java 按字节截断字符串

提问by stevebot

回答by Rex Kerr

回答by kan

回答by Zsolt Taskai

回答by bmargulies

回答by shadow

回答by Ilya Lysenko

回答by Nissim Avitan

回答by Peter Lawrey

回答by Сергей Сенько

回答by Gokul Limbe

相关推荐

最近更新

标签

java 按字节截断字符串

提问by stevebot

回答by Rex Kerr

回答by kan

回答by Zsolt Taskai

回答by bmargulies

回答by shadow

回答by Ilya Lysenko

回答by Nissim Avitan

回答by Peter Lawrey

回答by Сергей Сенько

回答by Gokul Limbe

相关推荐

在 Java 中，AtomicInteger compareAndSet() 与 synchronized 关键字的性能如何？

java glassfish v3 与 tomcat 7

java 休眠和继承 (TABLE_PER_CLASS)

java 集合<是什么意思？扩展 EmpApp>?

相关推荐

最近更新

标签