java 正则表达式和转义和非转义分隔符

Question

提问by lstipakov

question related to this

与此相关的问题

I have a string

我有一个字符串

a\;b\;c;d

which in Java looks like

在 Java 中看起来像

String s = "a\;b\\;c;d"

I need to split it by semicolon with following rules:

我需要用以下规则用分号分割它：

If semicolon is preceded by backslash, it should not be treated as separator (between aand b).
If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between band c).

如果分号前面有反斜杠，则不应将其视为分隔符（介于a和b之间）。
如果反斜杠本身被转义，因此不会转义分号，该分号应该是分隔符（在b和c之间）。

So semicolon should be treated as separator if there is either zero or even number of backslashes before it.

因此，如果分号之前有零个或偶数个反斜杠，则应将分号视为分隔符。

For example above, I want to get following strings (double backslashes for java compiler):

例如上面，我想得到以下字符串（java编译器的双反斜杠）：

a\;b\
c
d

Answer 1

采纳答案by Tim Pietzcker

You can use the regex

您可以使用正则表达式

(?:\.|[^;\]++)*

to match all text between unescaped semicolons:

匹配未转义分号之间的所有文本：

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\.|[^;\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

Explanation:

解释：

(?:        # Match either...
 \.       # any escaped character
|          # or...
 [^;\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.

++由于嵌套量词，所有格匹配 ( ) 对于避免灾难性的回溯很重要。

Answer 2

回答by FailedDev

String[] splitArray = subjectString.split("(?<!(?<!\\)\\);");

This should work.

这应该有效。

Explanation :

解释：

// (?<!(?<!\)\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) ?(?<!(?<!\)\)?
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) ?(?<!\)?
//       Match the character “\” literally ?\?
//    Match the character “\” literally ?\?
// Match the character “;” literally ?;?

So you just match the semicolons not preceded by exactly one \.

因此，您只需匹配前面没有正好为 1 的分号\。

EDIT :

编辑：

String[] splitArray = subjectString.split("(?<!(?<!\\(\\\\){0,2000000})\\);");

This will take care of any odd number of . It will of coursefail if you have more than 4000000 number of \. Explanation of edited answer :

这将处理任何奇数的 . 如果您有超过 4000000 个 \，它当然会失败。编辑答案的解释：

// (?<!(?<!\(\\){0,2000000})\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) ?(?<!(?<!\(\\){0,2000000})\)?
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) ?(?<!\(\\){0,2000000})?
//       Match the character “\” literally ?\?
//       Match the regular expression below and capture its match into backreference number 1 ?(\\){0,2000000}?
//          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) ?{0,2000000}?
//          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. ?{0,2000000}?
//          Match the character “\” literally ?\?
//          Match the character “\” literally ?\?
//    Match the character “\” literally ?\?
// Match the character “;” literally ?;?

Answer 3

回答by hochl

I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using Csince it's ages ago I last touched Java;-)

我不相信用任何类型的正则表达式来检测这些情况。我通常会为这些事情做一个简单的循环，我会用它来勾画它，C因为我上次接触它是很久以前的事了Java;-)

int i, len, state;
char c;

for (len=myString.size(), state=0, i=0; i < len; i++) {
    c=myString[i];
    if (state == 0) {
       if (c == '\') {
            state++;
       } else if (c == ';') {
           printf("; at offset %d", i);
       }
    } else {
        state--;
    }
}

The advantagesare:

的优点是：

you can execute semantic actions on each step.
it's quite easy to port it to another language.
you don't need to include the complete regex library just for this simple task, which adds to portability.
it should be a lot faster than the regular expression matcher.

您可以在每个步骤上执行语义操作。
将它移植到另一种语言非常容易。
您不需要为这个简单的任务包含完整的正则表达式库，这增加了可移植性。
它应该比正则表达式匹配器快很多。

Answer 4

回答by krico

This approach assumes that your string will not have char '\0'in your string. If you do, you can use some other char.

这种方法假设您的字符串中没有char '\0'您的字符串。如果这样做，您可以使用其他一些字符。

public static String[] split(String s) {
    String[] result = s.replaceAll("([^\\])\\;", "    final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\|";
    String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
    System.out.println(Arrays.toString(res));
").split(";");
    for (int i = 0; i < result.length; i++) {
        result[i] = result[i].replaceAll("(?<!((?:[^&]|^)(&&){0,10000}&))\|
", "\\;");
    }
    return result;
}

Answer 5

回答by Rasoul

This is the real answer i think. In my case i am trying to split using |and escape character is &.

这是我认为的真实答案。在我的情况下，我试图拆分 using|和转义字符是&.

##代码##

In this code i am using Lookbehind to escape & character. note that the look behind must have maximum length.

在这段代码中，我使用Lookbehind来转义 & 字符。注意后面的look必须有最大长度。

##代码##

this means any |except those that are following ((?:[^&]|^)(&&){0,10000}&))and this part means any odd number of &s. the part (?:[^&]|^)is important to make sure that you are counting all of the &s behind the |to the beginning or some other characters.

这意味着|除了后面的那些之外的任何一个((?:[^&]|^)(&&){0,10000}&))，这部分意味着任何奇数个&s。该部分(?:[^&]|^)很重要，以确保您正在计算开头&后面的所有s|或其他一些字符。

java 正则表达式和转义和非转义分隔符

提问by lstipakov

采纳答案by Tim Pietzcker

回答by FailedDev

回答by hochl

回答by krico

回答by Rasoul

相关推荐

最近更新

标签

java 正则表达式和转义和非转义分隔符

提问by lstipakov

采纳答案by Tim Pietzcker

回答by FailedDev

回答by hochl

回答by krico

回答by Rasoul

相关推荐

java CXF RESTful 客户端 - 如何信任所有证书？

java 排序链表实现

java 如何将包含 JPanel 的 JScrollPane 视口滚动到特定位置

java 假布尔值 = 真？

相关推荐

最近更新

标签