您将如何在 Ruby 中解析 url 以获取主域?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6674230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How would you parse a url in Ruby to get the main domain?
提问by Justin Meltzer
I want to be able to parse any url with ruby to get the main part of the domain without the www(just the XXXX.com)
我希望能够使用 ruby 解析任何 url 以获取域的主要部分www(仅 XXXX.com)
回答by Simone Carletti
Please note there is no algorithmic method of finding the highest level at which a domain may be registered for a particular top-level domain(the policies differ with each registry), the only method is to create a list of all top-level domains and the level at which domains can be registered.
请注意,没有算法方法可以找到可以为特定顶级域注册域的最高级别(每个注册管理机构的政策不同),唯一的方法是创建所有顶级域的列表,并且可以注册域的级别。
This is the reason why the Public Suffix Listexists.
这就是公共后缀列表存在的原因。
I'm the author of PublicSuffix, a Ruby library that decomposes a domain into the different parts.
我是PublicSuffix的作者,这是一个将域分解为不同部分的 Ruby 库。
Here's an example
这是一个例子
require 'uri/http'
uri = URI.parse("http://toolbar.google.com")
domain = PublicSuffix.parse(uri.host)
# => "toolbar.google.com"
domain.domain
# => "google.com"
uri = URI.parse("http://www.google.co.uk")
domain = PublicSuffix.parse(uri.host)
# => "www.google.co.uk"
domain.domain
# => "google.co.uk"
回答by Mischa
This should work with pretty much any URL:
这几乎适用于任何 URL:
# URL always gets parsed twice
def get_host_without_www(url)
url = "http://#{url}" if URI.parse(url).scheme.nil?
host = URI.parse(url).host.downcase
host.start_with?('www.') ? host[4..-1] : host
end
Or:
或者:
# Only parses twice if url doesn't start with a scheme
def get_host_without_www(url)
uri = URI.parse(url)
uri = URI.parse("http://#{url}") if uri.scheme.nil?
host = uri.host.downcase
host.start_with?('www.') ? host[4..-1] : host
end
You may have to require 'uri'.
你可能不得不require 'uri'。
回答by nlsrchtr
Just a short note: to overcome the second parsing of the url from Mischas second example, you could make a string comparison instead of URI.parse.
只是一个简短的说明:为了克服 Mischas 第二个示例中 url 的第二次解析,您可以进行字符串比较而不是 URI.parse。
# Only parses once
def get_host_without_www(url)
url = "http://#{url}" unless url.start_with?('http')
uri = URI.parse(url)
host = uri.host.downcase
host.start_with?('www.') ? host[4..-1] : host
end
The downside of this approach is, that it is limiting the url to http(s) based urls, which is widely the standard. But if you will use it more general (f.e. for ftp links) you have to adjust accordingly.
这种方法的缺点是,它将 url 限制为基于 http(s) 的 url,这是广泛的标准。但是,如果您将使用它更一般(fe 为 ftp 链接),您必须相应地进行调整。
回答by Sam
Addressableis probably the right answer in 2018, especially uses the PublicSuffix gem to parse domains.
Addressable可能是 2018 年的正确答案,尤其是使用 PublicSuffix gem 来解析域。
However, I need to do this kind of parsing in multiple places, from various data sources, and found it a bit verbose to use repeatedly. So I created a wrapper around it, Adomain:
但是,我需要在多个地方,从各种数据源进行这种解析,并且发现重复使用有点冗长。所以我围绕它创建了一个包装器,Adomain:
require 'adomain'
Adomain["https://toolbar.google.com"]
# => "toolbar.google.com"
Adomain["https://www.google.com"]
# => "google.com"
Adomain["stackoverflow.com"]
# => "stackoverflow.com"
I hope this helps others.
我希望这对其他人有帮助。
回答by pguardiario
Here's one that works better with .co.uk and .com.fr - type domains
这是一个更适合 .co.uk 和 .com.fr 类型的域
domain = uri.host[/[^.\s\/]+\.([a-z]{3,}|([a-z]{2}|com)\.[a-z]{2})$/]
回答by Daniel Antonio Nu?ez Carhuayo
Well you can write this method:
那么你可以写这个方法:
require 'URI'
def domain_name(url, arg={:with_dot_principal=>false})
arg[:with_dot_principal] ? URI(url).hostname.split('.').last(2).join('.') : URI(url).hostname.split('.').last(2).first
end
And using:
并使用:
domain_name("https://www.google.com/?gws_rd=ssl&safe=active&ssui=on")
# => "google"
domain_name("http://google.com", with_dot_principal: true)
# => "google.com"
回答by Tudor Constantin
if the URL is in format http://www.google.com, then you could do something like:
如果 URL 是 format http://www.google.com,那么您可以执行以下操作:
a = 'http://www.google.com'
puts a.split(/\./)[1] + '.' + a.split(/\./)[2]
Or
或者
a =~ /http:\/\/www\.(.*?)$/
puts

