java 正则表达式替换字符串中的所有 \n,但没有 [code] [/code] 标记中的那些

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/328387/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 11:59:02  来源:igfitidea点击:

Regex to replace all \n in a String, but no those inside [code] [/code] tag

javaregex

提问by Matías

I need help to replace all \n (new line) caracters for
in a String, but not those \n inside [code][/code] tags. My brain is burning, I can't solve this by my own :(

我需要帮助来替换
字符串中的所有 \n(新行)字符,但不是 [code][/code] 标签中的那些 \n。我的大脑在燃烧,我无法自己解决这个问题:(

Example:

例子:

test test test
test test test
test
test

[code]some
test
code
[/code]

more text

Should be:

应该:

test test test<br />
test test test<br />
test<br />
test<br />
<br />
[code]some
test
code
[/code]<br />
<br />
more text<br />

Thanks for your time. Best regards.

谢谢你的时间。最好的祝福。

采纳答案by strager

I would suggest a (simple) parser, and not a regular expression. Something like this (bad pseudocode):

我建议使用(简单的)解析器,而不是正则表达式。像这样(错误的伪代码):

stack elementStack;

foreach(char in string) {
    if(string-from-char == "[code]") {
        elementStack.push("code");
        string-from-char = "";
    }

    if(string-from-char == "[/code]") {
        elementStack.popTo("code");
        string-from-char = "";
    }

    if(char == "\n" && !elementStack.contains("code")) {
        char = "<br/>\n";
    }
}

回答by dmckee --- ex-moderator kitten

You've tagged the question regex, but this may not be the best tool for the job.

您已经标记了问题正则表达式,但这可能不是完成这项工作的最佳工具。

You might be better using basic compiler building techniques (i.e. a lexer feeding a simple state machine parser).

您可能会更好地使用基本的编译器构建技术(即一个词法分析器提供一个简单的状态机解析器)。

Your lexer would identify five tokens: ("[code]", '\n', "[/code]", EOF, :all other strings:) and your state machine looks like:

您的词法分析器将识别五个标记: ("[code]", '\n', "[/code]", EOF, :all other strings:) 并且您的状态机如下所示:

state    token    action
------------------------
begin    :none:   --> out
out      [code]   OUTPUT(token), --> in
out      \n       OUTPUT(break), OUTPUT(token)
out      *        OUTPUT(token)
in       [/code]  OUTPUT(token), --> out
in       *        OUTPUT(token)
*        EOF      --> end

EDIT: I see other poster discussing the possible need for nesting the blocks. This state machine won't handle that. For nesting blocks, use a recursive decent parser (not quite so simple but still easy enough and extensible).

编辑:我看到其他海报讨论可能需要嵌套块。这个状态机不会处理那个。对于嵌套块,使用递归体面的解析器(不是那么简单,但仍然足够简单和可扩展)。

EDIT: Axeman notes that this design excludes the use of "[/code]" in the code. An escape mechanism can be used to beat this. Something like add '\' to your tokens and add:

编辑:Axeman 指出,此设计不包括在代码中使用“[/code]”。可以使用逃逸机制来解决这个问题。像添加 '\' 到您的令牌并添加:

state    token    action
------------------------
in       \        -->esc-in
esc-in   *        OUTPUT(token), -->in
out      \        -->esc-out
esc-out  *        OUTPUT(token), -->out

to the state machine.

到状态机。

The usual arguments in favor of machine generated lexers and parsers apply.

支持机器生成的词法分析器和解析器的常用论据适用。

回答by cletus

This seems to do it:

这似乎做到了:

private final static String PATTERN = "\*+";

public static void main(String args[]) {
    Pattern p = Pattern.compile("(.*?)(\[/?code\])", Pattern.DOTALL);
    String s = "test 1 ** [code]test 2**blah[/code] test3 ** blah [code] test * 4 [code] test 5 * [/code] * test 6[/code] asdf **";
    Matcher m = p.matcher(s);
    StringBuffer sb = new StringBuffer(); // note: it has to be a StringBuffer not a StringBuilder because of the Pattern API
    int codeDepth = 0;
    while (m.find()) {
        if (codeDepth == 0) {
            m.appendReplacement(sb, m.group(1).replaceAll(PATTERN, ""));
        } else {
            m.appendReplacement(sb, m.group(1));
        }
        if (m.group(2).equals("[code]")) {
            codeDepth++;
        } else {
            codeDepth--;
        }
        sb.append(m.group(2));
    }
    if (codeDepth == 0) {
        StringBuffer sb2 = new StringBuffer();
        m.appendTail(sb2);
        sb.append(sb2.toString().replaceAll(PATTERN, ""));
    } else {
        m.appendTail(sb);
    }
    System.out.printf("Original: %s%n", s);
    System.out.printf("Processed: %s%n", sb);
}

Its not a straightforward regex but I don't think you can do what you want with a straightforward regex. Not with handling nested elements and so forth.

它不是一个简单的正则表达式,但我认为你不能用一个简单的正则表达式做你想做的事。不是处理嵌套元素等等。

回答by shsmurfy

As mentioned by other posters, regular expressions are not the best tool for the job because they are almost universally implemented as greedy algorithms. This means that even if you tried to match code blocks using something like:

正如其他发帖人所提到的,正则表达式不是这项工作的最佳工具,因为它们几乎普遍实现为贪婪算法。这意味着即使您尝试使用以下内容匹配代码块:

(\[code\].*\[/code\])

Then the expression will match everything from the first [code]tag to the last [/code]tag, which is clearly not what you want. While there are ways to get around this, the resulting regular expressions are usually brittle, unintuitive, and downright ugly. Something like the following python code would work much better.

然后表达式会匹配从第一个[code]标签到最后一个[/code]标签的所有内容,这显然不是你想要的。虽然有办法解决这个问题,但生成的正则表达式通常很脆弱、不直观,而且非常丑陋。像下面的 python 代码这样的东西会工作得更好。

output = []
def add_brs(str):
    return str.replace('\n','<br/>\n')
# the first block will *not* have a matching [/code] tag
blocks = input.split('[code]')
output.push(add_brs(blocks[0]))
# for all the rest of the blocks, only add <br/> tags to
# the segment after the [/code] segment
for block in blocks[1:]:
    if len(block.split('[/code]'))!=1:
        raise ParseException('Too many or few [/code] tags')
    else:
        # the segment in the code block is pre, everything
        # after is post
        pre, post = block.split('[/code]')
        output.push(pre)
        output.push(add_brs(post))
# finally join all the processed segments together
output = "".join(output)

Note the above code was nottested, it's just a rough idea of what you'll need to do.

请注意,上面的代码没有经过测试,这只是您需要做的事情的粗略想法。

回答by PhiLho

It is hard because if regexes are good at finding something, they are not so good at matching everything except something... So you have to use a loop, I doubt you can do that in one go.

这很难,因为如果正则表达式擅长寻找某些东西,那么它们就不太擅长匹配除某些东西之外的所有东西......所以你必须使用循环,我怀疑你能一次性做到这一点。

After searching, I found something close of cletus's solution, except I supposed code block cannot be nested, leading to simpler code: choose what is suited to your needs.

搜索后,我发现了一些接近 cletus 的解决方案,除了我认为代码块不能嵌套,导致代码更简单:选择适合您需求的内容。

import java.util.regex.*;

class Test
{
  static final String testString = "foo\nbar\n[code]\nprint'';\nprint{'c'};\n[/code]\nbar\nfoo";
  static final String replaceString = "<br>\n";
  public static void main(String args[])
  {
    Pattern p = Pattern.compile("(.+?)(\[code\].*?\[/code\])?", Pattern.DOTALL);
    Matcher m = p.matcher(testString);
    StringBuilder result = new StringBuilder();
    while (m.find()) 
    {
      result.append(m.group(1).replaceAll("\n", replaceString));
      if (m.group(2) != null)
      {
        result.append(m.group(2));
      }
    }
    System.out.println(result.toString());
  }
}

Crude quick test, you need more (null, empty string, no code tag, multiple, etc.).

粗略的快速测试,你需要更多(空、空字符串、无代码标签、多个等)。

回答by noah

To get it right, you really need to make three passes:

要做到正确,您确实需要通过三遍:

  1. Find [code] blocks and replace them with a unique token + index (saving the original block), e.g., "foo [code]abc[/code] bar[code]efg[/code]" becomes "foo TOKEN-1 barTOKEN-2"
  2. Do your newline replacement.
  3. Scan for escape tokens and restore the original block.
  1. 找到[code]块并用唯一的标记+索引替换它们(保存原始块),例如,“foo [code]abc[/code] bar[code]efg[/code]”变成“foo TOKEN-1 barTOKEN” -2"
  2. 做你的换行符替换。
  3. 扫描转义令牌并恢复原始块。

The code looks something* like:

代码看起来像*这样:

Matcher m = escapePattern.matcher(input);
while(m.find()) {
    String key = nextKey();
    escaped.put(key,m.group());
    m.appendReplacement(output1,"TOKEN-"+key);
}
m.appendTail(output1);
Matcher m2 = newlinePatten.matcher(output1);
while(m2.find()) {
    m.appendReplacement(output2,newlineReplacement);
}
m2.appendTail(output2);
Matcher m3 = Pattern.compile("TOKEN-(\d+)").matcher(output2); 
while(m3.find()) {
    m.appendReplacement(finalOutput,escaped.get(m3.group(1)));
}
m.appendTail(finalOutput);

That's the quick and dirty way. There are more efficient ways (others have mentioned parser/lexers), but unless you're processing millions of lines and your code is CPU bound (rather than I/O bound, like most webapps) and you've confirmed with a profiler that this is the bottleneck, they probably aren't worth it.

这是快速而肮脏的方式。有更有效的方法(其他人提到了解析器/词法分析器),但是除非您正在处理数百万行并且您的代码受 CPU 限制(而不是像大多数 web 应用程序那样受 I/O 限制)并且您已经使用分析器确认这是瓶颈,他们可能不值得。

* I haven't run it, this is all from memory. Just check the APIand you'll be able to work it out.

* 没跑过,全凭记忆。只需检查API,您就可以解决它。