Javascript/Regex 用于仅查找没有子域的根域名

Question

提问by jamesmhaley

I had a search and found lot's of similar regex examples, but not quite what I need.

我进行了搜索，发现了很多类似的正则表达式示例，但并不是我所需要的。

I want to be able to pass in the following urls and return the results:

我希望能够传入以下网址并返回结果：

www.google.comreturns google.com
sub.domains.are.cool.google.comreturns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.comreturns google.com
sub.domain.google.com/no/thanksreturns google.com

www.google.com返回google.com
sub.domains.are.cool.google.com返回google.com
dontmatterhowlongasubdomainis.idont.wantit.google.com返回google.com
sub.domain.google.com/no/thanks返回google.com

Hope that makes sense :) Thanks in advance!-James

希望这是有道理的:) 提前致谢！-詹姆斯

Answer 1

采纳答案by Tatham Oddie

You can't do this with a regular expression because you don't know how many blocks are in the suffix.

您不能使用正则表达式执行此操作，因为您不知道后缀中有多少个块。

For example google.comhas a suffix of com. To get from subdomain.google.comto google.comyou'd have to take the last two blocks - one for the suffix and one for google.

例如google.com的后缀为com。要从subdomain.google.com到google.com，您必须使用最后两个块 - 一个用于后缀，另一个用于google。

If you apply this logic to subdomain.google.co.ukthough you would end up with co.uk.

如果您将此逻辑应用于subdomain.google.co.uk ，尽管您最终会得到co.uk。

You will actually need to look up the suffix from a list like http://publicsuffix.org/

您实际上需要从像http://publicsuffix.org/这样的列表中查找后缀

Answer 2

回答by stormsweeper

Don't use regex, use the .split() method and work from there.

不要使用正则表达式，使用 .split() 方法并从那里开始工作。

var s = domain.split('.');

If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:

如果您的用例相当狭窄，您可以根据需要检查 TLD，然后根据需要返回最后 2 或 3 个段：

return s.slice(-2).join('.');

It'll make your eyes bleed less than any regex solution.

它会让你的眼睛流血比任何正则表达式解决方案都少。

Answer 3

回答by theraccoonbear

I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...

我没有对此进行大量测试，但是如果我理解您的要求，这应该是一个不错的起点......

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

EDIT:

编辑：

To clarify, it's looking for:

为了澄清，它正在寻找：

one or more alpha-numeric characters or dashes, followed by a literal dot

一个或多个字母数字字符或破折号，后跟一个文字点

and then one of three things...

然后是三件事之一......

three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)

三个或更多字母字符（即 com/net/mil/coop 等）
两个字母字符，后跟一个文字点，再跟两个字母（即 co.uk）
两个字母字符（即 us/uk/to 等）

and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).

最后，一个单词边界 (\b) 表示字符串的结尾、一个空格或一个非单词字符（在正则表达式中，单词字符通常是字母数字和下划线）。

As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.

正如我所说，我没有做太多测试，但这似乎是一个合理的起点。您可能需要尝试并对其进行一些调整，即便如此，您也不太可能获得所有测试用例的 100% 测试结果。有一些考虑因素，比如 Unicode 域名和各种技术上有效但你可能不会遇到的东西，这些东西会绊倒这样一个简单的正则表达式，但这可能会得到你 90%+ 的路在那里。

Answer 4

回答by Gajus

If you have limited subset of data, I suggest to keep the regex simple, e.g.

如果您的数据子集有限，我建议保持正则表达式简单，例如

(([a-z\-]+)(?:\.com|\.fr|\.co.uk))

This will match:

这将匹配：

www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com

In my case, I know that all relevant URLs will be matched using this regex.

就我而言，我知道所有相关的 URL 都将使用此正则表达式进行匹配。

Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.

收集示例数据集并针对您的正则表达式对其进行测试。在进行原型设计时，您可以使用诸如https://regex101.com/r/aG9uT0/1 之类的工具来完成。在开发中，使用测试脚本将其自动化。

Answer 5

回答by Emeka

Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:

在没有测试顶级域的有效性的情况下，我使用的是 Stormsweeper 解决方案的改编版：

domain = 'sub.domains.are.cool.google.com'

s = domain.split('.')

tld = s.slice(-2..-1).join('.')

Javascript/Regex 用于仅查找没有子域的根域名

提问by jamesmhaley

采纳答案by Tatham Oddie

回答by stormsweeper

回答by theraccoonbear

回答by Gajus

回答by Emeka

相关推荐

最近更新

标签

Javascript/Regex 用于仅查找没有子域的根域名

提问by jamesmhaley

采纳答案by Tatham Oddie

回答by stormsweeper

回答by theraccoonbear

回答by Gajus

回答by Emeka

相关推荐

javascript 从外部调用准备好的 jQuery 内部定义的函数

JavaScript - 取消滚动事件

javascript 我可以拦截直接调用的函数吗？

javascript Google 地图 fitBounds 无法正常工作

相关推荐

最近更新

标签