Javascript 正则表达式 - 提取子域和域

Question

提问by sunilkumarba

I'm trying to form a regular expression (javascript/node.js) which will extract the sub-domain & domain part from any given URL. This is what I ended up with:

我正在尝试形成一个正则表达式 (javascript/node.js)，它将从任何给定的 URL 中提取子域和域部分。这就是我最终的结果：

[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)

Right now, I'm just considering http, https for protocol & exclude "www." portion from the subdomain+domain portion of an URL. I checked the expression & it almost works. But, here is the issue:

现在，我只是在考虑将 http、https 用于协议并排除“www”。来自 URL 的子域 + 域部分的部分。我检查了表达式，它几乎可以工作。但是，这里有一个问题：

Success

成功

'http://mplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://lplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

Failure

失败

'http://play.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://tplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

I just use the first element from the result array. I'm not able to understand why "play." & "tplay." doesn't work. Could anyone please help me in this regard?

我只使用结果数组中的第一个元素。我不明白为什么要“玩”。&“播放”。不起作用。任何人都可以在这方面帮助我吗？

Does "/p" and "/t" have any meaning for the regular expression evaluator?

"/p" 和 "/t" 对正则表达式计算器有什么意义吗？

Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?

有没有其他方法可以使用正则表达式从任何给定的 URL 中提取子域和域？

Edit -

编辑 -

Example:

例子：

https://play.google.com/store/apps/details?id=com.skgames.trafficracer=> play.google.com

https://mail.google.com/mail/u/0/#inbox=> mail.google.com

Answer 1

回答by anubhava

Your regex doesn't seem correct. Try this regex:

您的正则表达式似乎不正确。试试这个正则表达式：

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)/img

RegEx Demo

正则表达式演示

Answer 2

回答by sunilkumarba

You are about the one millionth person to try to parse URLs in JavaScript. I'm a little bit surprised you didn't see any of the existing questions on SO dating back years. The last thing you want to do is write yet another broken regexp, with all due respect to those that provided answers to your question.

您大约是尝试用 JavaScript 解析 URL 的百万分之一。我有点惊讶你没有看到任何关于 SO 的现有问题可以追溯到几年前。您要做的最后一件事是编写另一个损坏的正则表达式，并充分尊重那些为您的问题提供答案的人。

There are many well documented libraries and approaches to handling this. Google it. The simplest way is to create an aelement in memory, assign it an href, and then access its hostnameand other properties. See http://tutorialzine.com/2013/07/quick-tip-parse-urls/. If that does not float your boat, then use a library like uri.js.

有许多有据可查的库和方法来处理这个问题。去谷歌上查询。最简单的方法是a在内存中创建一个元素，为其分配一个href，然后访问它的hostname和其他属性。请参阅http://tutorialzine.com/2013/07/quick-tip-parse-urls/。如果这不会让你的船漂浮，那么使用像uri.js这样的库。

If you really don't want to use a library, and insist on reinventing the wheel, then at least do something like the following:

如果您真的不想使用库，并坚持重新发明轮子，那么至少执行以下操作：

function get_domain_from_url(url) {
    var a = document.createElement('a').
    a.setAttribute('href', url);
    return a.hostname;
}

Essentially, you are delegating the extraction of the subdomain/domain part of the URL to the browser's URL parsing logic, which is MUCH better than anything you will ever write.

本质上，您将 URL 的子域/域部分的提取委托给浏览器的 URL 解析逻辑，这比您编写的任何内容都要好得多。

Also see Parse URL with jquery/ javascript?, Parse URL with Javascript, How do I parse a URL into hostname and path in javascript?, or parse URL with JavaScript or jQuery. How did you miss those? Sorry, I have to vote to close this as a duplicate.

另请参阅使用 jquery/javascript 解析 URL？,使用 Javascript 解析 URL,如何在 javascript 中将 URL 解析为主机名和路径？，或使用 JavaScript 或 jQuery 解析 URL。你是怎么错过这些的？抱歉，我必须投票以将其作为副本关闭。

Answer 3

回答by Nicu Surdu

The same RegExp as in anubhava'sanswer, only added support for protocol-relative URLslike //google.com:

与anubhava's答案相同的 RegExp ，仅添加了对协议相关 URL 的支持，例如//google.com：

/^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)/im

RegEx Demo

正则表达式演示

Answer 4

回答by Ashoka Lella

Here's a solution ignoring everything before ://

这是一个忽略之前所有内容的解决方案 ://

.*\://?([^\/]+)

Incase you want to ignore www.

如果你想忽略 www.

.*\://(?:www.)?([^\/]+)

Answer 5

回答by Academia

Your regex expression works pretty well. You only need to remove the brackets. The final expression is:

您的正则表达式效果很好。您只需要删除括号。最后的表达是：

^(?:http:\/\/|www\.|https:\/\/)([^\/]+)

Hope it's useful!

希望有用！

Javascript 正则表达式 - 提取子域和域

提问by sunilkumarba

回答by anubhava

RegEx Demo

正则表达式演示

回答by sunilkumarba

回答by Nicu Surdu

RegEx Demo

正则表达式演示

回答by Ashoka Lella

回答by Academia

相关推荐

最近更新

标签