php 在php中获取域名(不是子域)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2679618/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get domain name (not subdomain) in php
提问by Cyclone
I have a URL which can be any of the following formats:
我有一个 URL,它可以是以下任何一种格式:
http://example.com
https://example.com
http://example.com/foo
http://example.com/foo/bar
www.example.com
example.com
foo.example.com
www.foo.example.com
foo.bar.example.com
http://foo.bar.example.com/foo/bar
example.net/foo/bar
Essentially, I need to be able to match any normal URL. How can I extract example.com(or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?
本质上,我需要能够匹配任何正常的 URL。我如何example.com通过单个正则表达式从所有这些中提取(或 .net,无论 tld 是什么。我需要它来处理任何 TLD。)?
回答by Tyler Carter
Well you can use parse_urlto get the host:
那么你可以parse_url用来获取主机:
$info = parse_url($url);
$host = $info['host'];
Then, you can do some fancy stuff to get only the TLD and the Host
然后,您可以做一些奇特的事情来只获取 TLD 和主机
$host_names = explode(".", $host);
$bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];
Not very elegant, but should work.
不是很优雅,但应该工作。
If you want an explanation, here it goes:
如果你想解释,这里是:
First we grab everything between the scheme (http://, etc), by using parse_url's capabilities to... well.... parse URL's. :)
首先,我们http://通过使用parse_url的功能来获取方案(等)之间的所有内容......好吧......解析URL。:)
Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.mynamewould become:
然后我们取主机名,并根据句点所在的位置将其分成一个数组,因此test.world.hello.myname将变为:
array("test", "world", "hello", "myname");
After that, we take the number of elements in the array (4).
之后,我们取数组 (4) 中元素的数量。
Then, we subtract 2 from it to get the second to last string (the hostname, or example, in your example)
然后,我们从中减去 2 以获得倒数第二个字符串(主机名,或者example,在您的示例中)
Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD
然后,我们从中减去 1 得到最后一个字符串(因为数组键从 0 开始),也称为 TLD
Then we combine those two parts with a period, and you have your base host name.
然后我们将这两个部分用句点组合起来,你就有了你的基本主机名。
回答by pocesar
My solution in https://gist.github.com/pocesar/5366899
我在https://gist.github.com/pocesar/5366899 中的解决方案
and the tests are here http://codepad.viper-7.com/GAh1tP
测试在这里http://codepad.viper-7.com/GAh1tP
It works with any TLD, and hideous subdomain patterns (up to 3 subdomains).
它适用于任何 TLD 和可怕的子域模式(最多 3 个子域)。
There's a test included with many domain names.
许多域名都包含一个测试。
Won't paste the function here because of the weird indentation for code in StackOverflow (could have fenced code blocks like github)
不会在这里粘贴函数,因为 StackOverflow 中代码的缩进很奇怪(可以像 github 一样有围栏代码块)
回答by mgutt
It is not possible to get the domain name without using a TLD list to compare with as their exist many cases with completely the same structure and length:
不使用 TLD 列表进行比较是不可能获得域名的,因为它们存在许多结构和长度完全相同的情况:
- www.db.de (Subdomain) versus bbc.co.uk (Domain)
- big.uk.com (SLD) versus www.uk.com (TLD)
- www.db.de(子域)与 bbc.co.uk(域)
- big.uk.com (SLD) 与 www.uk.com (TLD)
Mozilla's public suffix list should be the best option as it is used by all major browsers:
https://publicsuffix.org/list/public_suffix_list.dat
Mozilla 的公共后缀列表应该是最好的选择,因为它被所有主要浏览器使用:https:
//publicsuffix.org/list/public_suffix_list.dat
Feel free to use my function:
随意使用我的功能:
function tld_list($cache_dir=null) {
// we use "/tmp" if $cache_dir is not set
$cache_dir = isset($cache_dir) ? $cache_dir : sys_get_temp_dir();
$lock_dir = $cache_dir . '/public_suffix_list_lock/';
$list_dir = $cache_dir . '/public_suffix_list/';
// refresh list all 30 days
if (file_exists($list_dir) && @filemtime($list_dir) + 2592000 > time()) {
return $list_dir;
}
// use exclusive lock to avoid race conditions
if (!file_exists($lock_dir) && @mkdir($lock_dir)) {
// read from source
$list = @fopen('https://publicsuffix.org/list/public_suffix_list.dat', 'r');
if ($list) {
// the list is older than 30 days so delete everything first
if (file_exists($list_dir)) {
foreach (glob($list_dir . '*') as $filename) {
unlink($filename);
}
rmdir($list_dir);
}
// now set list directory with new timestamp
mkdir($list_dir);
// read line-by-line to avoid high memory usage
while ($line = fgets($list)) {
// skip comments and empty lines
if ($line[0] == '/' || !$line) {
continue;
}
// remove wildcard
if ($line[0] . $line[1] == '*.') {
$line = substr($line, 2);
}
// remove exclamation mark
if ($line[0] == '!') {
$line = substr($line, 1);
}
// reverse TLD and remove linebreak
$line = implode('.', array_reverse(explode('.', (trim($line)))));
// we split the TLD list to reduce memory usage
touch($list_dir . $line);
}
fclose($list);
}
@rmdir($lock_dir);
}
// repair locks (should never happen)
if (file_exists($lock_dir) && mt_rand(0, 100) == 0 && @filemtime($lock_dir) + 86400 < time()) {
@rmdir($lock_dir);
}
return $list_dir;
}
function get_domain($url=null) {
// obtain location of public suffix list
$tld_dir = tld_list();
// no url = our own host
$url = isset($url) ? $url : $_SERVER['SERVER_NAME'];
// add missing scheme ftp:// http:// ftps:// https://
$url = !isset($url[5]) || ($url[3] != ':' && $url[4] != ':' && $url[5] != ':') ? 'http://' . $url : $url;
// remove "/path/file.html", "/:80", etc.
$url = parse_url($url, PHP_URL_HOST);
// replace absolute domain name by relative (http://www.dns-sd.org/TrailingDotsInDomainNames.html)
$url = trim($url, '.');
// check if TLD exists
$url = explode('.', $url);
$parts = array_reverse($url);
foreach ($parts as $key => $part) {
$tld = implode('.', $parts);
if (file_exists($tld_dir . $tld)) {
return !$key ? '' : implode('.', array_slice($url, $key - 1));
}
// remove last part
array_pop($parts);
}
return '';
}
What it makes special:
它的特别之处:
- it accepts every input like URLs, hostnames or domains with- or without scheme
- the list is downloaded row-by-row to avoid high memory usage
- it creates a new file per TLD in a cache folder so
get_domain()only needs to check throughfile_exists()if it exists so it does not need to include a huge database on every request like TLDExtractdoes it. - the list will be automatically updated every 30 days
- 它接受所有输入,如 URL、主机名或域,有或没有方案
- 该列表逐行下载以避免高内存使用
- 它在缓存文件夹中为每个 TLD 创建一个新文件,因此
get_domain()只需要检查file_exists()它是否存在,因此它不需要像TLDExtract那样在每个请求中都包含一个巨大的数据库。 - 该列表将每 30 天自动更新一次
Test:
测试:
$urls = array(
'http://www.example.com',// example.com
'http://subdomain.example.com',// example.com
'http://www.example.uk.com',// example.uk.com
'http://www.example.co.uk',// example.co.uk
'http://www.example.com.ac',// example.com.ac
'http://example.com.ac',// example.com.ac
'http://www.example.accident-prevention.aero',// example.accident-prevention.aero
'http://www.example.sub.ar',// sub.ar
'http://www.congresodelalengua3.ar',// congresodelalengua3.ar
'http://congresodelalengua3.ar',// congresodelalengua3.ar
'http://www.example.pvt.k12.ma.us',// example.pvt.k12.ma.us
'http://www.example.lib.wy.us',// example.lib.wy.us
'com',// empty
'.com',// empty
'http://big.uk.com',// big.uk.com
'uk.com',// empty
'www.uk.com',// www.uk.com
'.uk.com',// empty
'stackoverflow.com',// stackoverflow.com
'.foobarfoo',// empty
'',// empty
false,// empty
' ',// empty
1,// empty
'a',// empty
);
Recent version with explanations (German):
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm
最新版本的解释(德语):http:
//www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm
回答by user2116044
$onlyHostName = implode('.', array_slice(explode('.', parse_url($link, PHP_URL_HOST)), -2));
回答by mmeyer2k
I think the best way to handle this problem is:
我认为处理这个问题的最好方法是:
$second_level_domains_regex = '/\.asn\.au$|\.com\.au$|\.net\.au$|\.id\.au$|\.org\.au$|\.edu\.au$|\.gov\.au$|\.csiro\.au$|\.act\.au$|\.nsw\.au$|\.nt\.au$|\.qld\.au$|\.sa\.au$|\.tas\.au$|\.vic\.au$|\.wa\.au$|\.co\.at$|\.or\.at$|\.priv\.at$|\.ac\.at$|\.avocat\.fr$|\.aeroport\.fr$|\.veterinaire\.fr$|\.co\.hu$|\.film\.hu$|\.lakas\.hu$|\.ingatlan\.hu$|\.sport\.hu$|\.hotel\.hu$|\.ac\.nz$|\.co\.nz$|\.geek\.nz$|\.gen\.nz$|\.kiwi\.nz$|\.maori\.nz$|\.net\.nz$|\.org\.nz$|\.school\.nz$|\.cri\.nz$|\.govt\.nz$|\.health\.nz$|\.iwi\.nz$|\.mil\.nz$|\.parliament\.nz$|\.ac\.za$|\.gov\.za$|\.law\.za$|\.mil\.za$|\.nom\.za$|\.school\.za$|\.net\.za$|\.co\.uk$|\.org\.uk$|\.me\.uk$|\.ltd\.uk$|\.plc\.uk$|\.net\.uk$|\.sch\.uk$|\.ac\.uk$|\.gov\.uk$|\.mod\.uk$|\.mil\.uk$|\.nhs\.uk$|\.police\.uk$/';
$domain = $_SERVER['HTTP_HOST'];
$domain = explode('.', $domain);
$domain = array_reverse($domain);
if (preg_match($second_level_domains_regex, $_SERVER['HTTP_HOST']) {
$domain = "$domain[2].$domain[1].$domain[0]";
} else {
$domain = "$domain[1].$domain[0]";
}
回答by happy_marmoset
I recommend using TLDExtractlibrary for all operations with domain name.
我建议将TLDExtract库用于所有带有域名的操作。
回答by Ehsan Chavoshi
There are two ways to extract subdomain from a host:
有两种方法可以从主机中提取子域:
The first method that is more accurate is to use a database of tlds (like public_suffix_list.dat) and match domain with it. This is a little heavy in some cases. There are some PHP classes for using it like php-domain-parserand TLDExtract.
The second way is not as accurate as the first one, but is very fast and it can give the correct answer in many case, I wrote this function for it:
function get_domaininfo($url) { // regex can be replaced with parse_url preg_match("/^(https|http|ftp):\/\/(.*?)\//", "$url/" , $matches); $parts = explode(".", $matches[2]); $tld = array_pop($parts); $host = array_pop($parts); if ( strlen($tld) == 2 && strlen($host) <= 3 ) { $tld = "$host.$tld"; $host = array_pop($parts); } return array( 'protocol' => $matches[1], 'subdomain' => implode(".", $parts), 'domain' => "$host.$tld", 'host'=>$host,'tld'=>$tld ); }Example:
print_r(get_domaininfo('http://mysubdomain.domain.co.uk/index.php'));Returns:
Array ( [protocol] => https [subdomain] => mysubdomain [domain] => domain.co.uk [host] => domain [tld] => co.uk )
第一种更准确的方法是使用 tlds 数据库(如public_suffix_list.dat)并将域与其匹配。这在某些情况下有点重。有一些 PHP 类可以使用它,例如php-domain-parser和TLDExtract。
第二种方法不如第一种准确,但速度很快,而且在很多情况下都能给出正确答案,我为它编写了这个函数:
function get_domaininfo($url) { // regex can be replaced with parse_url preg_match("/^(https|http|ftp):\/\/(.*?)\//", "$url/" , $matches); $parts = explode(".", $matches[2]); $tld = array_pop($parts); $host = array_pop($parts); if ( strlen($tld) == 2 && strlen($host) <= 3 ) { $tld = "$host.$tld"; $host = array_pop($parts); } return array( 'protocol' => $matches[1], 'subdomain' => implode(".", $parts), 'domain' => "$host.$tld", 'host'=>$host,'tld'=>$tld ); }例子:
print_r(get_domaininfo('http://mysubdomain.domain.co.uk/index.php'));返回:
Array ( [protocol] => https [subdomain] => mysubdomain [domain] => domain.co.uk [host] => domain [tld] => co.uk )
回答by Greg Z
Here's a function I wrote to grab the domain without subdomain(s), regardless of whether the domain is using a ccTLD or a new style long TLD, etc... There is no lookup or huge array of known TLDs, and there's no regex. It can be a lot shorter using the ternary operator and nesting, but I expanded it for readability.
这是我编写的一个函数,用于获取没有子域的域,无论域是使用 ccTLD 还是新样式的长 TLD 等......没有查找或大量已知 TLD,也没有正则表达式. 使用三元运算符和嵌套可以缩短很多时间,但我对其进行了扩展以提高可读性。
// Per Wikipedia: "All ASCII ccTLD identifiers are two letters long,
// and all two-letter top-level domains are ccTLDs."
function topDomainFromURL($url) {
$url_parts = parse_url($url);
$domain_parts = explode('.', $url_parts['host']);
if (strlen(end($domain_parts)) == 2 ) {
// ccTLD here, get last three parts
$top_domain_parts = array_slice($domain_parts, -3);
} else {
$top_domain_parts = array_slice($domain_parts, -2);
}
$top_domain = implode('.', $top_domain_parts);
return $top_domain;
}
回答by Kuldip
echo getDomainOnly("http://example.com/foo/bar");
function getDomainOnly($host){
$host = strtolower(trim($host));
$host = ltrim(str_replace("http://","",str_replace("https://","",$host)),"www.");
$count = substr_count($host, '.');
if($count === 2){
if(strlen(explode('.', $host)[1]) > 3) $host = explode('.', $host, 2)[1];
} else if($count > 2){
$host = getDomainOnly(explode('.', $host, 2)[1]);
}
$host = explode('/',$host);
return $host[0];
}
回答by multimediaxp
Very late, I see that you marked regex as a keyword and my function works like a charm, so far I haven't found a url that fails:
很晚了,我看到您将 regex 标记为关键字,我的函数就像一个魅力,到目前为止我还没有找到一个失败的网址:
function get_domain_regex($url){
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}else{
return false;
}
}
if you want one without regex I have this one, which I am sure I also took from this post
如果你想要一个没有正则表达式的我有这个,我相信我也是从这篇文章中获取的
function get_domain($url){
$parseUrl = parse_url($url);
$host = $parseUrl['host'];
$host_array = explode(".", $host);
$domain = $host_array[count($host_array)-2] . "." . $host_array[count($host_array)-1];
return $domain;
}
They both work amazing, BUT, this took me a while to realize if the url doesn't start with http:// or https:// it will fail so make sure the url string starts with the protocol.
他们都工作得很好,但是,这花了我一段时间才意识到如果 url 不是以 http:// 或 https:// 开头,它就会失败,所以请确保 url 字符串以协议开头。

