Java 给定一个字符串,生成一个可以解析 *similar* 字符串的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/776286/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 19:20:57  来源:igfitidea点击:

Given a string, generate a regex that can parse *similar* strings

javaregex

提问by Yossale

For example, given the string "2009/11/12" I want to get the regex ("\d{2}/d{2}/d{4}"), so I'll be able to match "2001/01/02" too.

例如,给定字符串“2009/11/12”,我想获取正则表达式(“\d{2}/d{2}/d{4}”),因此我将能够匹配“2001/ 01/02”也是。

Is there something that does that? Something similar? Any idea' as to how to do it?

有什么可以做到的吗?相似的东西?知道怎么做吗?

采纳答案by Tomalak

There is text2re, a free web-based "regex by example" generator.

text2re,一个免费的基于网络的“示例正则表达式”生成器。

I don't think this is available in source code, though. I dare to say there is no automatic regex generator that gets it right without user intervention, since this would require the machine knowing what you want.

不过,我不认为这在源代码中可用。我敢说没有自动正则表达式生成器可以在没有用户干预的情况下正确运行,因为这需要机器知道你想要什么。



Note that text2re uses a template-based, modularized and very generalized approach to regular expression generation. The expressions it generates work, but they are much more complex than the equivalent hand-crafted expression. It is not a good tool to learnregular expressions because it does a pretty lousy job at setting examples.

请注意,text2re 使用基于模板、模块化且非常通用的方法来生成正则表达式。它生成的表达式有效,但它们比等效的手工表达式复杂得多。它不是学习正则表达式的好工具,因为它在设置示例方面做得非常糟糕。

For instance, the string "2009/11/12"would be recognized as a yyyymmddpattern, which is helpful. The tool transforms it into this125 character monster:

例如,字符串"2009/11/12"将被识别为yyyymmdd模式,这很有帮助。该工具将其转换为这个125 个字符的怪物:

((?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[0-2]?\d{1})|(?:[3][01]{1})))(?![\d])

The hand-made equivalent would take up merely two fifths of that (50 characters):

手工制作的等价物仅占其中的五分之二(50 个字符):

([12]\d{3})[-:/.](0?\d|1[0-2])[-:/.]([0-2]?\d|3[01])\b

回答by B.E.

It's not possible to write a general solution for your problem. The trouble is that any generator probably wouldn't know what you want to check for, e.g. should "2312/45/67" be allowed too? What about "2009.11.12"?

不可能为您的问题编写通用的解决方案。问题是任何生成器可能都不知道您要检查什么,例如是否也应该允许“2312/45/67”?“2009.11.12”呢?

What you could do is write such a generator yourself that is suited for your exact problem, but a general solution won't be possible.

您可以做的是自己编写一个适合您的确切问题的生成器,但不可能有通用的解决方案。

回答by soulmerge

No, you cannot get a regex that matches what you want reliably, since the regex would not contain semantic information about the input (i.e. it would need to know it's generating a regex for dates). If the issue is with dates only I would recommend trying multiple regular expressions and see if one of them matches all.

不,您无法获得与您想要的可靠匹配的正则表达式,因为正则表达式不包含有关输入的语义信息(即它需要知道它正在为日期生成正则表达式)。如果问题仅与日期有关,我建议尝试多个正则表达式,看看其中一个是否与所有正则表达式匹配。

回答by Cogsy

I'm not sure if this is possible, at least not without many sample strings and some learning algorithm.

我不确定这是否可行,至少在没有很多示例字符串和一些学习算法的情况下是不可能的。

There are many regex' that would match and it's not possible for a simple algorithm to pick the 'right' one. You'd need to give it some delimiters or other things to look for, so you might as well just write the regex yourself.

有许多正则表达式可以匹配,而简单的算法不可能选择“正确”的。您需要给它一些定界符或其他要查找的东西,因此您不妨自己编写正则表达式。

回答by MahdeTo

I don't remember the name but if my theory of computation cells serve me right its impossible in theory :)

我不记得名字了,但如果我的计算单元理论对我有用的话,理论上是不可能的:)

回答by dfa

I've tried a very naive approach:

我尝试了一种非常幼稚的方法:

class RegexpGenerator {

    public static Pattern generateRegexp(String prototype) {
        return Pattern.compile(generateRegexpFrom(prototype));
    }

    private static String generateRegexpFrom(String prototype) {
        StringBuilder stringBuilder = new StringBuilder();

        for (int i = 0; i < prototype.length(); i++) {
            char c = prototype.charAt(i);

            if (Character.isDigit(c)) {
                stringBuilder.append("\d");
            } else if (Character.isLetter(c)) {
                stringBuilder.append("\w");
            } else { // falltrought: literal
                stringBuilder.append(c);
            }
        }

        return stringBuilder.toString();
    }

    private static void test(String prototype) {
        Pattern pattern = generateRegexp(prototype);
        System.out.println(String.format("%s -> %s", prototype, pattern));

        if (!pattern.matcher(prototype).matches()) {
            throw new AssertionError();
        }
    }

    public static void main(String[] args) {
        String[] prototypes = {
            "2009/11/12",
            "I'm a test",
            "me too!!!",
            "124.323.232.112",
            "ISBN 332212"
        };

        for (String prototype : prototypes) {
            test(prototype);
        }
    }
}

output:

输出:

2009/11/12 -> \d\d\d\d/\d\d/\d\d
I'm a test -> \w'\w \w \w\w\w\w
me too!!! -> \w\w \w\w\w!!!
124.323.232.112 -> \d\d\d.\d\d\d.\d\d\d.\d\d\d
ISBN 332212 -> \w\w\w\w \d\d\d\d\d\d

2009/11/12 -> \d\d​​\d\d/\d\d/\d\d
我是一个测试 -> \w'\w \w \w\w\w\w
我也是! !! -> \w\w \w\w\w!!!
124.323.232.112 -> \d\d​​\d.\d\d\d.\d\d\d.\d\d\d
ISBN 332212 -> \w\w\w\w \d\d​​\d \d\d\d

As already outlined by others a general solution to this problem is impossible. This class is applicable only in few contexts

正如其他人已经概述的那样,这个问题的一般解决方案是不可能的。此类仅适用于少数情况

回答by Kyle Dyer

sounds like a machine learning problem. You'll have to have more than one example on hand (many more) and an indication of whether or not each example is considered a match or not.

听起来像是机器学习问题。您手头必须有多个示例(更多示例),并说明每个示例是否被视为匹配。

回答by Yossale

I haven't found anything that does it , but since the problem domain is relativelysmall (you'd be surprised how many people use the weirdest date formats) , I've able to write some kind of a "date regular expression generator". Once I'm satisfied with the unit tests , I'll publish it - just in case someone will ever need something of the kind.

我没有找到任何可以做到这一点的东西,但是由于问题域相对较小(您会惊讶于有多少人使用最奇怪的日期格式),我已经能够编写某种“日期正则表达式生成器” . 一旦我对单元测试感到满意,我就会发布它——以防万一有人需要这样的东西。

Thanks to everyone who answered (the guy with the (.*) excluded - jokes are great , but this one was sssssssssoooo lame :) )

感谢所有回答的人(排除了 (.*) 的人 - 笑话很棒,但这个笑话很蹩脚 :) )

回答by pashute

Excuse me, but what you all call impossible is clearly an achievable task. It will not be able to give results for ALL examples, and maybe not the best results, but you can give it various hints, and it will make life easy. A few examples will follow.

对不起,你们都说不可能的事情显然是可以实现的。它无法给出所有示例的结果,也可能不是最好的结果,但您可以给它各种提示,它会让生活变得轻松。下面将举几个例子。

Also a readable output translating the result would be very useful. Something like:

此外,翻译结果的可读输出将非常有用。就像是:

  • "Search for: a word starting with a non-numeric letter and ending with the string: "ing".
  • or: Search for: text that has bbb in it, followed somewhere by zzz
  • or: *Search for: a pattern which looks so "aa/bbbb/cccc" where "/" is a separator, "aa" is two digits, "bbbb" is a word of any length and "cccc" are four digits between 1900 and 2020 *
  • “搜索:以非数字字母开头并以字符串结尾的单词:“ing”。
  • 或:搜索:包含 bbb 的文本,然后是 zzz
  • 或: *搜索:一个看起来像“aa/bbbb/cccc”的模式,其中“/”是分隔符,“aa”是两位数字,“bbbb”是任意长度的单词,“cccc”是四位数字1900 年和 2020 年 *

Maybe we could make a "back translator" with an SQL type of language to create regex, instead of creating it in geekish.

也许我们可以用 SQL 类型的语言制作一个“反向翻译器”来创建正则表达式,而不是用极客来创建它。

Here's are a few examples that are doable:

以下是一些可行的示例:

class Hint: 
  Properties: HintType, HintString
  enum HintType { Separator, ParamDescription, NumberOfParameters }
  enum SampleType { FreeText, DateOrTime, Formatted, ... }
  public string RegexBySamples( List<T> samples, 
         List<SampleType> sampleTypes, 
         List<Hint> hints, 
         out string GeneralRegExp, out string description, 
         out string generalDescription)...

regex = RegExpBySamples( {"11/November/1999", "2/January/2003"}, 
                     SampleType.DateOrTime, 
                     new HintList( HintType.NumberOfParameters, 3 ));

regex = RegExpBySamples( "123-aaaaJ-1444", 
                         SampleType.Format, HintType.Seperator, "-" );

A GUI where you mark sample text or enter it, adding to the regex would be possible too. First you mark a date (the "sample"), and choose if this text is already formatted, or if you are building a format, also what the format type is: free text, formatted text, date, GUID or Choose... from existing formats (which you can store in library).

一个用于标记示例文本或输入它的 GUI,也可以添加到正则表达式。首先你标记一个日期(“样本”),然后选择这个文本是否已经格式化,或者如果你正在构建一个格式,还有格式类型是什么:自由文本、格式化文本、日期、GUID 或选择...从现有格式(您可以存储在库中)。

Lets design a spec for this, and make it open source... Anybody wants to join?

让我们为此设计一个规范,并使其开源......有人想加入吗?

回答by pashute

Loretopretty much does this. It's an open source implementation using the common longest substring(s) to generate the regular expressions. Needs multiple examples of course, though.

洛雷托几乎做到了这一点。它是使用公共最长子字符串生成正则表达式的开源实现。不过,当然需要多个示例。