Java 用于检索 domain.tld 的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/863297/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 20:23:25  来源:igfitidea点击:

Regular expression to retrieve domain.tld

javaregex

提问by sjobe

I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.comwill all return foo.com.

我需要一个 Java 中的正则表达式,我可以用它来从任何 url 检索 domain.tld 部分。所以https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com都会返回 foo.com 。

I wrote this regex, but it's matching the whole url

我写了这个正则表达式,但它匹配整个 url

Pattern.compile("[.]?.*[.x][a-z]{2,3}");

I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.

我不确定我是否匹配“。” 性格对。我试过 ”。” 但我从 netbeans 收到错误消息。

Update:

更新:

The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/barshould return foo.co.uk.

tld 不限于 2 或 3 个字符,http://www.foo.co.uk/bar应返回 foo.co.uk。

采纳答案by idrosid

I would use the java.net.URI class to extract the host name, and then use a regex to extract the last two parts of the host uri.

我会使用 java.net.URI 类来提取主机名,然后使用正则表达式来提取主机 uri 的最后两部分。

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RunIt {

    public static void main(String[] args) throws URISyntaxException {
        Pattern p = Pattern.compile(".*?([^.]+\.[^.]+)");

        String[] urls = new String[] {
                "https://foo.com/bar",
                "http://www.foo.com#bar",
                "http://bar.foo.com"
        };

        for (String url:urls) {
            URI uri = new URI(url);
            //eg: uri.getHost() will return "www.foo.com"
            Matcher m = p.matcher(uri.getHost());
            if (m.matches()) {
                System.out.println(m.group(1));
            }
        }
    }
}

Prints:

印刷:

foo.com
foo.com
foo.com

回答by jsamsa

This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:

这比你想象的要难。您的示例https://foo.com/bar中包含一个逗号,这是一个有效的 URL 字符。这是一篇关于一些麻烦的好文章:

https://blog.codinghorror.com/the-problem-with-urls/

https://blog.codinghorror.com/the-problem-with-urls/

https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])

Is a good starting point

是一个很好的起点

Some listings from "Mastering Regular Expressions" on this topic:

“掌握正则表达式”中关于此主题的一些列表:

http://regex.info/listing.cgi?ed=3&p=207

http://regex.info/listing.cgi?ed=3&p=207

@sjobe

@sjobe

>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)

sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.

对不起,这个例子是在python而不是java中,它更简短。Java 需要对正则表达式进行一些无关的转义。

回答by Adam Pope

You're going to need to get a list of all possible TLDs and ccTLDs and then match against them. You have to do this else you'll never be able to distinguish between subdomain.dom.com and hello.co.uk.

您将需要获得所有可能的 TLD 和 ccTLD 的列表,然后与它们进行匹配。您必须这样做,否则您将永远无法区分 subdomain.dom.com 和 hello.co.uk。

So, get your self such a list. I recommend inverting it so you store, for example, uk.co. Then, you can extract the domain from a URL by getting everying between // and / or end of line. Split at . and work backwards, matching the TLD and then 1 additional level to get the domain.

所以,给你自己一个这样的清单。我建议将其反转以便存储,例如 uk.co。然后,您可以通过获取 // 和 / 或行尾之间的所有内容从 URL 中提取域。拆分于 。并反向工作,匹配 TLD,然后再增加 1 个级别以获得域。

回答by Qtax

If the string contains a valid URL then you could use a regex like (Perl quoting):

如果字符串包含有效的 URL,那么您可以使用像(Perl 引用)这样的正则表达式:

/^
(?:\w+:\/\/)?
[^:?#\/\s]*?

(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)

(?:[:?#\/]|$)
/xi;

Results:

结果:

url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk

For Java it would be quoted something like:

对于 Java,它会被引用如下:

"^(?:\w+://)?[^:?#/\s]*?([^.\s]+\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___))(?:[:?#/]|$)"

Of course you'll need to replace the etcpart.

当然,您需要更换部件。

Example Perl script:

Perl 脚本示例:

use strict;

my @test = qw(
    https://foo.com/bar
    http://www.foo.com#bar
    http://bar.foo.com
    ftp://foo.com
    ftp://www.foo.co.uk?bar
    ftp://www.foo.co.uk:8080/bar
);

for(@test){
    print "url: $_\n";

    /^
    (?:\w+:\/\/)?
    [^:?#\/\s]*?

    (
    [^.\s]+
    \.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
    )

    (?:[:?#\/]|$)
    /xi;

    print "matched: \n";
}

回答by Amy B

new URL(url).getHost()

new URL(url).getHost()

No regex needed.

不需要正则表达式。

回答by mel

    /[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$/

Almost there, but won't match when second-level domain has 3 characters like this: www.foo.comTest it here.

差不多了,但是当二级域有 3 个这样的字符时将不匹配:www.foo.comTest it here

回答by tomisyourname

This works for me:

这对我有用:

public static String getDomain(String url){
    if(TextUtils.isEmpty(url)) return null;
    String domain = null;
    if(url.startsWith("http://")) {
        url = url.replace("http://", "").trim();
    } else if(url.startsWith("https://")) {
        url = url.replace("https://", "").trim();
    }
    String[] temp = url.split("/");
    if(temp != null && temp.length > 0) {
        domain = temp[0];
    }  
    return domain;
}

回答by Yeongjun Kim

Code:

代码:

public class DomainUrlUtils {
    private static String[] TLD = {"com", "net"}; // top-level domain
    private static String[] SLD = {"co\.kr"}; // second-level domain

    public static String getDomainName(String url) {
        Pattern pattern = Pattern.compile("(?<=)[^(\.|\/)]\w+\.(" + joinTldAndSld("|") + ")$");
        Matcher match = pattern.matcher(url);
        String domain = null;

        if (match.find()) {
            domain = match.group();
        }

        return domain;
    }

    private static String joinTldAndSld(String delimiter) {
        String t = String.join(delimiter, TLD);
        String s = String.join(delimiter, SLD);

        return new StringBuilder(t).append(s.isEmpty() ? "" : "|" + s).toString();
    }
}

Test:

测试:

public class DomainUrlUtilsTest {

    @Test
    public void getDomainName() throws Exception {
        // given
        String[][] domainUrls = {
            {
                "test.com",
                "sub1.test.com",
                "sub1.sub2.test.com",
                "https://sub1.test.com",
                "http://sub1.sub2.test.com"
            },
            {
                "https://domain.com",
                "https://sub.domain.com"
            },
            {
                "http://domain.co.kr",
                "http://sub.domain.co.kr",
                "http://local.sub.domain.co.kr",
                "http://local-test.sub.domain.co.kr",
                "sub.domain.co.kr",
                "domain.co.kr",
                "test.sub.domain.co.kr"
            }
        };

        String[] expectedUrls = {
            "test.com",
            "domain.com",
            "domain.co.kr"
        };

        // when
        // then
        for (int domainIndex = 0; domainIndex < domainUrls.length; domainIndex++) {
            for (String url : domainUrls[domainIndex]) {
                String convertedUrl = DomainUrlUtils.getDomainName(url);

                if (expectedUrls[domainIndex].equals(convertedUrl)) {
                    System.out.println(url + " -> " + convertedUrl);
                } else {
                    Assert.fail("origin Url: " + url + " / converted Url: " + convertedUrl);
                }
            }
        }
    }
}

Results:

结果:

test.com -> test.com
sub1.test.com -> test.com
sub1.sub2.test.com -> test.com
https://sub1.test.com -> test.com
http://sub1.sub2.test.com -> test.com
https://domain.com -> domain.com
https://sub.domain.com -> domain.com
http://domain.co.kr -> domain.co.kr
http://sub.domain.co.kr -> domain.co.kr
http://local.sub.domain.co.kr -> domain.co.kr
http://local-test.sub.domain.co.kr -> domain.co.kr
sub.domain.co.kr -> domain.co.kr