如何在 Java 中取消转义 Java 字符串文字？

Question

提问by ziggystar

I'm processing some Java source code using Java. I'm extracting the string literals and feeding them to a function taking a String. The problem is that I need to pass the unescaped version of the String to the function (i.e. this means converting \nto a newline, and \\to a single \, etc).

我正在使用 Java 处理一些 Java 源代码。我正在提取字符串文字并将它们提供给采用字符串的函数。问题是我需要将 String 的未转义版本传递给函数（即这意味着转换\n为换行符和\\单个\，等等）。

Is there a function inside the Java API that does this? If not, can I obtain such functionality from some library? Obviously the Java compiler has to do this conversion.

Java API 中是否有执行此操作的函数？如果没有，我可以从某个库中获得这样的功能吗？显然，Java 编译器必须进行这种转换。

In case anyone wants to know, I'm trying to un-obfuscate string literals in decompiled obfuscated Java files.

如果有人想知道，我正在尝试在反编译的混淆 Java 文件中取消混淆字符串文字。

Answer 1

采纳答案by tchrist

The Problem

问题

The org.apache.commons.lang.StringEscapeUtils.unescapeJava()given here as another answer is really very little help at all.

这里org.apache.commons.lang.StringEscapeUtils.unescapeJava()给出的另一个答案实际上根本没有帮助。

It forgets about \0for null.
It doesn't handle octal at all.
It can't handle the sorts of escapes admitted by the java.util.regex.Pattern.compile()and everything that uses it, including \a, \e, and especially \cX.
It has no support for logical Unicode code points by number, only for UTF-16.
This looks like UCS-2 code, not UTF-16 code: they use the depreciated charAtinterface instead of the codePointinterface, thus promulgating the delusion that a Java charis guaranteed to hold a Unicode character. It's not. They only get away with this because no UTF-16 surrogate will wind up looking for anything they're looking for.

它忘记了\0为空。
它不处理八都。
它无法处理java.util.regex.Pattern.compile()和使用它的所有内容所承认的各种转义，包括\a、\e，尤其是\cX。
它不支持按数字划分的逻辑 Unicode 代码点，仅支持 UTF-16。
这看起来像 UCS-2 代码，而不是 UTF-16 代码：它们使用折旧的charAt接口而不是codePoint接口，从而散布了 Javachar保证保留 Unicode 字符的错觉。它不是。他们只是侥幸逃脱，因为没有 UTF-16 代理会最终寻找他们正在寻找的任何东西。

The Solution

解决方案

I wrote a string unescaper which solves the OP's question without all the irritations of the Apache code.

我写了一个字符串 unescaper，它解决了 OP 的问题，而没有 Apache 代码的所有烦恼。

/*
 *
 * unescape_perl_string()
 *
 *      Tom Christiansen <[email protected]>
 *      Sun Nov 28 12:55:24 MST 2010
 *
 * It's completely ridiculous that there's no standard
 * unescape_java_string function.  Since I have to do the
 * damn thing myself, I might as well make it halfway useful
 * by supporting things Java was too stupid to consider in
 * strings:
 * 
 *   => "?" items  are additions to Java string escapes
 *                 but normal in Java regexes
 *
 *   => "!" items  are also additions to Java regex escapes
 *   
 * Standard singletons: ?\a ?\e \f \n \r \t
 * 
 *      NB: \b is unsupported as backspace so it can pass-through
 *          to the regex translator untouched; I refuse to make anyone
 *          doublebackslash it as doublebackslashing is a Java idiocy
 *          I desperately wish would die out.  There are plenty of
 *          other ways to write it:
 *
 *              \cH, , 2, \x08 \x{8}, \u0008, \U00000008
 *
 * Octal escapes:     String in = "a\tb\n\\"c\\"";

    System.out.println(in);
    // a\tb\n\"c\"

    String out = StringEscapeUtils.unescapeJava(in);

    System.out.println(out);
    // a    b
    // "c"
     /* Unicode escape test #1: PASS */

    System.out.println(
        "\u0030"
    ); // 0
    System.out.println(
        StringEscapeUtils.unescapeJava("\u0030")
    ); // 0
    System.out.println(
        "\u0030".equals(StringEscapeUtils.unescapeJava("\u0030"))
    ); // true

    /* Octal escape test: FAIL */

    System.out.println(
        ""
    ); // %
    System.out.println(
        StringEscapeUtils.unescapeJava("\45")
    ); // 45
    System.out.println(
        "".equals(StringEscapeUtils.unescapeJava("\45"))
    ); // false

    /* Unicode escape test #2: FAIL */

    System.out.println(
        "\uu0030"
    ); // 0
    System.out.println(
        StringEscapeUtils.unescapeJava("\uu0030")
    ); // throws NestableRuntimeException:
       //   Unable to parse unicode value: u003
N Blah blah...
Column delimiter=;
Word delimiter=\u0020 #This is just unicode for whitespace

.. more stuff
NN \N \NN \NNN
 *    Can range up to !7 not 7
 *    
 *      TODO: add !\o{NNNNN}
 *          last Unicode is 4177777
 *          maxint is 37777777777
 *
 * Control chars: ?\cX
 *      Means: ord(X) ^ ord('@')
 *
 * Old hex escapes: \xXX
 *      unbraced must be 2 xdigits
 *
 * Perl hex escapes: !\x{XXX} braced may be 1-8 xdigits
 *       NB: proper Unicode never needs more than 6, as highest
 *           valid codepoint is 0x10FFFF, not maxint 0xFFFFFFFF
 *
 * Lame Java escape: \[IDIOT JAVA PREPROCESSOR]uXXXX must be
 *                   exactly 4 xdigits;
 *
 *       I can't write XXXX in this comment where it belongs
 *       because the damned Java Preprocessor can't mind its
 *       own business.  Idiots!
 *
 * Lame Python escape: !\UXXXXXXXX must be exactly 8 xdigits
 * 
 * TODO: Perl translation escapes: \Q \U \L \E \[IDIOT JAVA PREPROCESSOR]u \l
 *       These are not so important to cover if you're passing the
 *       result to Pattern.compile(), since it handles them for you
 *       further downstream.  Hm, what about \[IDIOT JAVA PREPROCESSOR]u?
 *
 */

public final static
String unescape_perl_string(String oldstr) {

    /*
     * In contrast to fixing Java's broken regex charclasses,
     * this one need be no bigger, as unescaping shrinks the string
     * here, where in the other one, it grows it.
     */

    StringBuffer newstr = new StringBuffer(oldstr.length());

    boolean saw_backslash = false;

    for (int i = 0; i < oldstr.length(); i++) {
        int cp = oldstr.codePointAt(i);
        if (oldstr.codePointAt(i) > Character.MAX_VALUE) {
            i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
        }

        if (!saw_backslash) {
            if (cp == '\') {
                saw_backslash = true;
            } else {
                newstr.append(Character.toChars(cp));
            }
            continue; /* switch */
        }

        if (cp == '\') {
            saw_backslash = false;
            newstr.append('\');
            newstr.append('\');
            continue; /* switch */
        }

        switch (cp) {

            case 'r':  newstr.append('\r');
                       break; /* switch */

            case 'n':  newstr.append('\n');
                       break; /* switch */

            case 'f':  newstr.append('\f');
                       break; /* switch */

            /* PASS a \b THROUGH!! */
            case 'b':  newstr.append("\b");
                       break; /* switch */

            case 't':  newstr.append('\t');
                       break; /* switch */

            case 'a':  newstr.append('"Word delimiter=\u0020 #This is just unicode for whitespace"
7');
                       break; /* switch */

            case 'e':  newstr.append('3');
                       break; /* switch */

            /*
             * A "control" character is what you get when you xor its
             * codepoint with '@'==64.  This only makes sense for ASCII,
             * and may not yield a "control" character after all.
             *
             * Strange but true: "\c{" is ";", "\c}" is "=", etc.
             */
            case 'c':   {
                if (++i == oldstr.length()) { die("trailing \c"); }
                cp = oldstr.codePointAt(i);
                /*
                 * don't need to grok surrogates, as next line blows them up
                 */
                if (cp > 0x7f) { die("expected ASCII after \c"); }
                newstr.append(Character.toChars(cp ^ 64));
                break; /* switch */
            }

            case '8':
            case '9': die("illegal octal digit");
                      /* NOTREACHED */

    /*
     * may be 0 to 2 octal digits following this one
     * so back up one for fallthrough to next case;
     * unread this digit and fall through to next case.
     */
            case '1':
            case '2':
            case '3':
            case '4':
            case '5':
            case '6':
            case '7': --i;
                      /* FALLTHROUGH */

            /*
             * Can have 0, 1, or 2 octal digits following a 0
             * this permits larger values than octal 377, up to
             * octal 777.
             */
            case '0': {
                if (i+1 == oldstr.length()) {
                    /* found {...., '=', '\', 'u', '0', '0', '2', '0', ' ', '#', 't', 'h', ...}
 at end of string */
                    newstr.append(Character.toChars(0));
                    break; /* switch */
                }
                i++;
                int digits = 0;
                int j;
                for (j = 0; j <= 2; j++) {
                    if (i+j == oldstr.length()) {
                        break; /* for */
                    }
                    /* safe because will unread surrogate */
                    int ch = oldstr.charAt(i+j);
                    if (ch < '0' || ch > '7') {
                        break; /* for */
                    }
                    digits++;
                }
                if (digits == 0) {
                    --i;
                    newstr.append('public static String[] unescapeJavaStrings(String... escaped) {
    //class name
    final String className = "Temp" + System.currentTimeMillis();
    //build the source
    final StringBuilder source = new StringBuilder(100 + escaped.length * 20).
            append("public class ").append(className).append("{\n").
            append("\tpublic static String[] getStrings() {\n").
            append("\t\treturn new String[] {\n");
    for (String string : escaped) {
        source.append("\t\t\t\"");
        //we escape non-escaped quotes here to be safe 
        //  (but something like \" will fail, oh well for now)
        for (int i = 0; i < string.length(); i++) {
            char chr = string.charAt(i);
            if (chr == '"' && i > 0 && string.charAt(i - 1) != '\') {
                source.append('\');
            }
            source.append(chr);
        }
        source.append("\",\n");
    }
    source.append("\t\t};\n\t}\n}\n");
    //obtain compiler
    final JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
    //local stream for output
    final ByteArrayOutputStream out = new ByteArrayOutputStream();
    //local stream for error
    ByteArrayOutputStream err = new ByteArrayOutputStream();
    //source file
    JavaFileObject sourceFile = new SimpleJavaFileObject(
            URI.create("string:///" + className + Kind.SOURCE.extension), Kind.SOURCE) {
        @Override
        public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException {
            return source;
        }
    };
    //target file
    final JavaFileObject targetFile = new SimpleJavaFileObject(
            URI.create("string:///" + className + Kind.CLASS.extension), Kind.CLASS) {
        @Override
        public OutputStream openOutputStream() throws IOException {
            return out;
        }
    };
    //file manager proxy, with most parts delegated to the standard one 
    JavaFileManager fileManagerProxy = (JavaFileManager) Proxy.newProxyInstance(
            StringUtils.class.getClassLoader(), new Class[] { JavaFileManager.class },
            new InvocationHandler() {
                //standard file manager to delegate to
                private final JavaFileManager standard = 
                    compiler.getStandardFileManager(null, null, null); 
                @Override
                public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
                    if ("getJavaFileForOutput".equals(method.getName())) {
                        //return the target file when it's asking for output
                        return targetFile;
                    } else {
                        return method.invoke(standard, args);
                    }
                }
            });
    //create the task
    CompilationTask task = compiler.getTask(new OutputStreamWriter(err), 
            fileManagerProxy, null, null, null, Collections.singleton(sourceFile));
    //call it
    if (!task.call()) {
        throw new RuntimeException("Compilation failed, output:\n" + 
                new String(err.toByteArray()));
    }
    //get the result
    final byte[] bytes = out.toByteArray();
    //load class
    Class<?> clazz;
    try {
        //custom class loader for garbage collection
        clazz = new ClassLoader() { 
            protected Class<?> findClass(String name) throws ClassNotFoundException {
                if (name.equals(className)) {
                    return defineClass(className, bytes, 0, bytes.length);
                } else {
                    return super.findClass(name);
                }
            }
        }.loadClass(className);
    } catch (ClassNotFoundException e) {
        throw new RuntimeException(e);
    }
    //reflectively call method
    try {
        return (String[]) clazz.getDeclaredMethod("getStrings").invoke(null);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}
');
                    break; /* switch */
                }
                int value = 0;
                try {
                    value = Integer.parseInt(
                                oldstr.substring(i, i+digits), 8);
                } catch (NumberFormatException nfe) {
                    die("invalid octal value for \0 escape");
                }
                newstr.append(Character.toChars(value));
                i += digits-1;
                break; /* switch */
            } /* end case '0' */

            case 'x':  {
                if (i+2 > oldstr.length()) {
                    die("string too short for \x escape");
                }
                i++;
                boolean saw_brace = false;
                if (oldstr.charAt(i) == '{') {
                        /* ^^^^^^ ok to ignore surrogates here */
                    i++;
                    saw_brace = true;
                }
                int j;
                for (j = 0; j < 8; j++) {

                    if (!saw_brace && j == 2) {
                        break;  /* for */
                    }

                    /*
                     * ASCII test also catches surrogates
                     */
                    int ch = oldstr.charAt(i+j);
                    if (ch > 127) {
                        die("illegal non-ASCII hex digit in \x escape");
                    }

                    if (saw_brace && ch == '}') { break; /* for */ }

                    if (! ( (ch >= '0' && ch <= '9')
                                ||
                            (ch >= 'a' && ch <= 'f')
                                ||
                            (ch >= 'A' && ch <= 'F')
                          )
                       )
                    {
                        die(String.format(
                            "illegal hex digit #%d '%c' in \x", ch, ch));
                    }

                }
                if (j == 0) { die("empty braces in \x{} escape"); }
                int value = 0;
                try {
                    value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \x escape");
                }
                newstr.append(Character.toChars(value));
                if (saw_brace) { j++; }
                i += j-1;
                break; /* switch */
            }

            case 'u': {
                if (i+4 > oldstr.length()) {
                    die("string too short for \u escape");
                }
                i++;
                int j;
                for (j = 0; j < 4; j++) {
                    /* this also handles the surrogate issue */
                    if (oldstr.charAt(i+j) > 127) {
                        die("illegal non-ASCII hex digit in \u escape");
                    }
                }
                int value = 0;
                try {
                    value = Integer.parseInt( oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \u escape");
                }
                newstr.append(Character.toChars(value));
                i += j-1;
                break; /* switch */
            }

            case 'U': {
                if (i+8 > oldstr.length()) {
                    die("string too short for \U escape");
                }
                i++;
                int j;
                for (j = 0; j < 8; j++) {
                    /* this also handles the surrogate issue */
                    if (oldstr.charAt(i+j) > 127) {
                        die("illegal non-ASCII hex digit in \U escape");
                    }
                }
                int value = 0;
                try {
                    value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \U escape");
                }
                newstr.append(Character.toChars(value));
                i += j-1;
                break; /* switch */
            }

            default:   newstr.append('\');
                       newstr.append(Character.toChars(cp));
           /*
            * say(String.format(
            *       "DEFAULT unrecognized escape %c passed through",
            *       cp));
            */
                       break; /* switch */

        }
        saw_backslash = false;
    }

    /* weird to leave one at the end */
    if (saw_backslash) {
        newstr.append('\');
    }

    return newstr.toString();
}

/*
 * Return a string "U+XX.XXX.XXXX" etc, where each XX set is the
 * xdigits of the logical Unicode code point. No bloody brain-damaged
 * UTF-16 surrogate crap, just true logical characters.
 */
 public final static
 String uniplus(String s) {
     if (s.length() == 0) {
         return "";
     }
     /* This is just the minimum; sb will grow as needed. */
     StringBuffer sb = new StringBuffer(2 + 3 * s.length());
     sb.append("U+");
     for (int i = 0; i < s.length(); i++) {
         sb.append(String.format("%X", s.codePointAt(i)));
         if (s.codePointAt(i) > Character.MAX_VALUE) {
             i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
         }
         if (i+1 < s.length()) {
             sb.append(".");
         }
     }
     return sb.toString();
 }

private static final
void die(String foa) {
    throw new IllegalArgumentException(foa);
}

private static final
void say(String what) {
    System.out.println(what);
}

If it helps others, you're welcome to it — no strings attached. If you improve it, I'd love for you to mail me your enhancements, but you certainly don't have to.

如果它对其他人有帮助，欢迎您加入 - 没有任何附加条件。如果你改进它，我很乐意你把你的改进寄给我，但你当然不必。

Answer 2

回答by Lasse Espeholt

See this from http://commons.apache.org/lang/:

从http://commons.apache.org/lang/看到这个：

StringEscapeUtils

字符串转义工具

StringEscapeUtils.unescapeJava(String str)

Answer 3

回答by polygenelubricants

You can use String unescapeJava(String)method of StringEscapeUtilsfrom Apache Commons Lang.

您可以使用String unescapeJava(String)的方法StringEscapeUtils从阿帕奇共享郎。

Here's an example snippet:

这是一个示例片段：

public static void main(String[] meh) {
    if ("1\n".equals(unescapeJavaStrings("1\02\03\n")[0])) {
        System.out.println("Success");
    } else {
        System.out.println("Failure");
    }
}

The utility class has methods to escapes and unescape strings for Java, Java Script, HTML, XML, and SQL. It also has overloads that writes directly to a java.io.Writer.

实用程序类具有转义和取消转义 Java、Java Script、HTML、XML 和 SQL 字符串的方法。它还具有直接写入 a 的重载java.io.Writer。

Caveats

注意事项

It looks like StringEscapeUtilshandles Unicode escapes with one u, but not octal escapes, or Unicode escapes with extraneous us.

它看起来像是StringEscapeUtils用 1 处理 Unicode 转义u，而不是八进制转义，或者用无关的us处理 Unicode 转义。

import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Decoder {

    // The encoded character of each character escape.
    // This array functions as the keys of a sorted map, from encoded characters to decoded characters.
    static final char[] ENCODED_ESCAPES = { '\"', '\'', '\',  'b',  'f',  'n',  'r',  't' };

    // The decoded character of each character escape.
    // This array functions as the values of a sorted map, from encoded characters to decoded characters.
    static final char[] DECODED_ESCAPES = { '\"', '\'', '\', '\b', '\f', '\n', '\r', '\t' };

    // A pattern that matches an escape.
    // What follows the escape indicator is captured by group 1=character 2=octal 3=Unicode.
    static final Pattern PATTERN = Pattern.compile("\\(?:(b|t|n|f|r|\\"|\\'|\\)|((?:[0-3]?[0-7])?[0-7])|u+(\p{XDigit}{4}))");

    public static CharSequence decodeString(CharSequence encodedString) {
        Matcher matcher = PATTERN.matcher(encodedString);
        StringBuffer decodedString = new StringBuffer();
        // Find each escape of the encoded string in succession.
        while (matcher.find()) {
            char ch;
            if (matcher.start(1) >= 0) {
                // Decode a character escape.
                ch = DECODED_ESCAPES[Arrays.binarySearch(ENCODED_ESCAPES, matcher.group(1).charAt(0))];
            } else if (matcher.start(2) >= 0) {
                // Decode an octal escape.
                ch = (char)(Integer.parseInt(matcher.group(2), 8));
            } else /* if (matcher.start(3) >= 0) */ {
                // Decode a Unicode escape.
                ch = (char)(Integer.parseInt(matcher.group(3), 16));
            }
            // Replace the escape with the decoded character.
            matcher.appendReplacement(decodedString, Matcher.quoteReplacement(String.valueOf(ch)));
        }
        // Append the remainder of the encoded string to the decoded string.
        // The remainder is the longest suffix of the encoded string such that the suffix contains no escapes.
        matcher.appendTail(decodedString);
        return decodedString;
    }

    public static void main(String... args) {
        System.out.println(decodeString(args[0]));
    }
}

A quote from the JLS:

来自 JLS 的引用：

Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000through \u00FF, so Unicode escapes are usually preferred.

八进制转义是为了与 C 兼容而提供的，但只能\u0000通过表示 Unicode 值\u00FF，因此通常首选 Unicode 转义。

If your string can contain octal escapes, you may want to convert them to Unicode escapes first, or use another approach.

如果您的字符串可以包含八进制转义符，您可能希望先将它们转换为 Unicode 转义符，或者使用其他方法。

The extraneous uis also documented as follows:

无关u的也记录如下：

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u-for example, \uxxxxbecomes \uuxxxx-while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a compiler for the Java programming language and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single uto the corresponding single Unicode character.

Java 编程语言指定了将用 Unicode 编写的程序转换为 ASCII 的标准方法，该方法将程序更改为可由基于 ASCII 的工具处理的形式。转换涉及通过添加额外的u- 例如，\uxxxx变成\uuxxxx- 将程序源文本中的任何 Unicode 转义转换为 ASCII ，同时将源文本中的非 ASCII 字符转换为每个包含单个 u 的 Unicode 转义。
这个转换后的版本同样可以被 Java 编程语言的编译器接受，并且表示完全相同的程序。稍后可以通过将每个u存在多个的转义序列转换为一个少一个的 Unicode 字符序列u，同时将每个转义序列与单个转义序列转换u为相应的单个 Unicode 字符，从而从这种 ASCII 格式中恢复确切的 Unicode 源。

If your string can contain Unicode escapes with extraneous u, then you may also need to preprocess this before using StringEscapeUtils.

如果您的字符串可以包含带有无关的 Unicode 转义符u，那么您可能还需要在使用StringEscapeUtils.

Alternatively you can try to write your own Java string literal unescaper from scratch, making sure to follow the exact JLS specifications.

或者，您可以尝试从头开始编写自己的 Java 字符串文字 unescaper，确保遵循确切的 JLS 规范。

References

参考

Answer 4

回答by Ashwin Jayaprakash

If you are reading unicode escaped chars from a file, then you will have a tough time doing that because the string will be read literally along with an escape for the back slash:

如果您正在从文件中读取 unicode 转义字符，那么您将很难做到这一点，因为将逐字读取字符串以及反斜杠的转义符：

my_file.txt

我的文件.txt

import java.io.*;

// ...

String literal = "\"Has \\"\\\\t\\" & isn\\'t \\r\\n on 1 line.\"";
StreamTokenizer parser = new StreamTokenizer(new StringReader(literal));
String result;
try {
  parser.nextToken();
  if (parser.ttype == '"') {
    result = parser.sval;
  }
  else {
    result = "ERROR!";
  }
}
catch (IOException e) {
  result = e.toString();
}
System.out.println(result);

Here, when you read line 3 from the file the string/line will have:

在这里，当您从文件中读取第 3 行时，字符串/行将具有：

Has "\  " & isn't
 on 1 line.

and the char[] in the string will show:

字符串中的 char[] 将显示：

/**
 * Unescapes a string that contains standard Java escape sequences.
 * <ul>
 * <li><strong>&#92;b &#92;f &#92;n &#92;r &#92;t &#92;" &#92;'</strong> :
 * BS, FF, NL, CR, TAB, double and single quote.</li>
 * <li><strong>&#92;X &#92;XX &#92;XXX</strong> : Octal character
 * specification (0 - 377, 0x00 - 0xFF).</li>
 * <li><strong>&#92;uXXXX</strong> : Hexadecimal based Unicode character.</li>
 * </ul>
 * 
 * @param st
 *            A string optionally containing standard java escape sequences.
 * @return The translated string.
 */
public String unescapeJavaString(String st) {

    StringBuilder sb = new StringBuilder(st.length());

    for (int i = 0; i < st.length(); i++) {
        char ch = st.charAt(i);
        if (ch == '\') {
            char nextChar = (i == st.length() - 1) ? '\' : st
                    .charAt(i + 1);
            // Octal escape?
            if (nextChar >= '0' && nextChar <= '7') {
                String code = "" + nextChar;
                i++;
                if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
                        && st.charAt(i + 1) <= '7') {
                    code += st.charAt(i + 1);
                    i++;
                    if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
                            && st.charAt(i + 1) <= '7') {
                        code += st.charAt(i + 1);
                        i++;
                    }
                }
                sb.append((char) Integer.parseInt(code, 8));
                continue;
            }
            switch (nextChar) {
            case '\':
                ch = '\';
                break;
            case 'b':
                ch = '\b';
                break;
            case 'f':
                ch = '\f';
                break;
            case 'n':
                ch = '\n';
                break;
            case 'r':
                ch = '\r';
                break;
            case 't':
                ch = '\t';
                break;
            case '\"':
                ch = '\"';
                break;
            case '\'':
                ch = '\'';
                break;
            // Hex Unicode: u????
            case 'u':
                if (i >= st.length() - 5) {
                    ch = 'u';
                    break;
                }
                int code = Integer.parseInt(
                        "" + st.charAt(i + 2) + st.charAt(i + 3)
                                + st.charAt(i + 4) + st.charAt(i + 5), 16);
                sb.append(Character.toChars(code));
                i += 5;
                continue;
            }
            i++;
        }
        sb.append(ch);
    }
    return sb.toString();
}

Commons StringUnescape will not unescape this for you (I tried unescapeXml()). You'll have to do it manually as described here.

Commons StringUnescape 不会为您取消转义（我试过 unescapeXml()）。您必须按照此处所述手动执行此操作。

So, the sub-string "\u0020" should become 1 single char '\u0020'

所以，子串 "\u0020" 应该变成 1 个单字符 '\u0020'

But if you are using this "\u0020" to do String.split("... ..... ..", columnDelimiterReadFromFile)which is really using regex internally, it will work directly because the string read from file was escaped and is perfect to use in the regex pattern!! (Confused?)

但是如果你使用这个 "\u0020" 来做String.split("... ..... ..", columnDelimiterReadFromFile)这真的在内部使用正则表达式，它会直接工作，因为从文件中读取的字符串被转义并且非常适合在正则表达式模式中使用！！（使困惑？）

Answer 5

回答by Chad Retz

I'm a little late on this, but I thought I'd provide my solution since I needed the same functionality. I decided to use the Java Compiler API which makes it slower, but makes the results accurate. Basically I live create a class then return the results. Here is the method:

我在这方面有点晚了，但我想我会提供我的解决方案，因为我需要相同的功能。我决定使用 Java Compiler API，这使它变慢，但使结果准确。基本上我现场创建一个类然后返回结果。这是方法：

StringContext.treatEscapes(escaped)

It takes an array so you can unescape in batches. So the following simple test succeeds:

它需要一个数组，因此您可以批量转义。所以下面的简单测试成功了：

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.4</version>
        </dependency>

Answer 6

回答by Nathan Ryan

I came across the same problem, but I wasn't enamoured by any of the solutions I found here. So, I wrote one that iterates over the characters of the string using a matcher to find and replace the escape sequences. This solution assumes properly formatted input. That is, it happily skips over nonsensical escapes, and it decodes Unicode escapes for line feed and carriage return (which otherwise cannot appear in a character literal or a string literal, due to the definition of such literals and the order of translation phases for Java source). Apologies, the code is a bit packed for brevity.

我遇到了同样的问题，但我对这里找到的任何解决方案都不着迷。因此，我编写了一个使用匹配器遍历字符串字符以查找和替换转义序列的方法。此解决方案假定输入格式正确。也就是说，它愉快地跳过了无意义的转义，并解码了换行和回车的 Unicode 转义（否则，由于此类文字的定义和 Java 的翻译阶段的顺序，它们不能出现在字符文字或字符串文字中来源）。抱歉，为简洁起见，代码有点紧凑。

##代码##

I should note that Apache Commons Lang3 doesn't seem to suffer the weaknesses indicated in the accepted solution. That is, StringEscapeUtilsseems to handle octal escapes and multiple ucharacters of Unicode escapes. That means unless you have some burning reason to avoid Apache Commons, you should probably use it rather than my solution (or any other solution here).

我应该注意到 Apache Commons Lang3 似乎没有遭受已接受的解决方案中指出的弱点。也就是说，StringEscapeUtils似乎处理八进制转义和多u字符 Unicode 转义。这意味着除非您有一些迫切的理由避免使用 Apache Commons，否则您应该使用它而不是我的解决方案（或此处的任何其他解决方案）。

Answer 7

回答by DaoWen

I know this question was old, but I wanted a solution that doesn't involve libraries outside those included JRE6 (i.e. Apache Commons is not acceptable), and I came up with a simple solution using the built-in java.io.StreamTokenizer:

我知道这个问题很老，但我想要一个不涉及 JRE6 之外的库的解决方案（即 Apache Commons 是不可接受的），我想出了一个使用内置的简单解决方案java.io.StreamTokenizer：

##代码##

Output:

输出：

##代码##

Answer 8

回答by Udo Klimaschewski

Came across a similar problem, wasn't also satisfied with the presented solutions and implemented this one myself.

遇到了类似的问题，对提出的解决方案也不满意，自己实施了这个。

Also available as a Gist on Github:

也可作为Github上的 Gist 使用：

##代码##

Answer 9

回答by Tvaroh

For the record, if you use Scala, you can do:

作为记录，如果您使用 Scala，您可以执行以下操作：

##代码##

Answer 10

回答by Jens Piegsa

org.apache.commons.lang3.StringEscapeUtilsfrom commons-lang3 is marked deprecated now. You can use org.apache.commons.text.StringEscapeUtils#unescapeJava(String)instead. It requires an additional Maven dependency:

org.apache.commons.lang3.StringEscapeUtils来自 commons-lang3 现在被标记为已弃用。你可以org.apache.commons.text.StringEscapeUtils#unescapeJava(String)改用。它需要一个额外的Maven 依赖：

##代码##

and seems to handle some more special cases, it e.g. unescapes:

并且似乎处理一些更特殊的情况，例如 unescapes：

escaped backslashes, single and double quotes
escaped octal and unicode values
\\b, \\n, \\t, \\f, \\r

转义反斜杠、单引号和双引号
转义八进制和 Unicode 值
\\b, \\n, \\t, \\f,\\r

如何在 Java 中取消转义 Java 字符串文字？

提问by ziggystar

采纳答案by tchrist

The Problem

问题

The Solution

解决方案

回答by Lasse Espeholt

回答by polygenelubricants

Caveats

注意事项

References

参考

回答by Ashwin Jayaprakash

回答by Chad Retz

回答by Nathan Ryan

回答by DaoWen

回答by Udo Klimaschewski

回答by Tvaroh

回答by Jens Piegsa

相关推荐

最近更新

标签

如何在 Java 中取消转义 Java 字符串文字？

提问by ziggystar

采纳答案by tchrist

The Problem

问题

The Solution

解决方案

回答by Lasse Espeholt

回答by polygenelubricants

Caveats

注意事项

References

参考

回答by Ashwin Jayaprakash

回答by Chad Retz

回答by Nathan Ryan

回答by DaoWen

回答by Udo Klimaschewski

回答by Tvaroh

回答by Jens Piegsa

相关推荐

为什么我们需要 Java 中的接口？

Java - 类型不匹配：无法从元素类型对象转换为字符串

Java 类路径和构建路径有什么区别

Java 在二叉树中查找节点的父节点

相关推荐

最近更新

标签