Mysql 查询以从 url 中提取域

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9280336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 12:08:07  来源:igfitidea点击:

Mysql query to extract domains from urls

mysql

提问by peppolone

sorry for my english

对不起我的英语不好

i have this query to extract domain from urls

我有这个查询从网址中提取域

SELECT SUBSTRING(LEFT(url, LOCATE('/', url, 8) - 1), 8) AS domain...

it works only when the url is like www.google.com/something

仅当网址类似于 www.google.com/something 时才有效

it doesn't work with urls like

它不适用于像这样的网址

www.google.it (without trailing slash)

www.google.it(不带斜杠)

www.google.it/abc/xzy/ (permalink)

www.google.it/abc/xzy/(永久链接)

do you know how to solve it?

你知道如何解决吗?

回答by LostNomad311

I had to combine some of the previous answers , plus a little more hackery for my data set . This is what works for me , it returns the domain and any sub-domains:

我不得不结合一些以前的答案,再加上我的数据集的一些hackery。这对我有用,它返回域和任何子域:

SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain

Explanation ( cause non-trivial SQL rarely makes sense ):

解释(导致非平凡的 SQL 很少有意义):

SUBSTRING_INDEX(target_url, '/', 3)- strips any path if the url has a protocol
SUBSTRING_INDEX(THAT, '://', -1)- strips any protocol from THAT
SUBSTRING_INDEX(THAT, '/', 1)- strips any path from THAT ( if there was no protocol )
SUBSTRING_INDEX(THAT, '?', 1)- strips the query string from THAT ( if there was no path or trailing / )

SUBSTRING_INDEX(target_url, '/', 3)- 如果 url 有
SUBSTRING_INDEX(THAT, '://', -1)协议,则删除
SUBSTRING_INDEX(THAT, '/', 1)任何路径 -从 THAT 中删除任何协议 - 从 THAT 中删除任何路径(如果没有协议)
SUBSTRING_INDEX(THAT, '?', 1)- 从 THAT 中删除查询字符串(如果没有路径或尾随 / )

Test Cases:

测试用例:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
FROM ( 
    SELECT       'http://test.com' as target_url 
    UNION SELECT 'https://test.com' 
    UNION SELECT 'http://test.com/one' 
    UNION SELECT 'http://test.com/?huh' 
    UNION SELECT 'http://test.com?http://ouch.foo' 
    UNION SELECT 'test.com' 
    UNION SELECT 'test.com/one'
    UNION SELECT 'test.com/one/two'
    UNION SELECT 'test.com/one/two/three'
    UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
    UNION SELECT 'http://one.test.com'
    UNION SELECT 'one.test.com/one'
    UNION SELECT 'two.one.test.com/one' ) AS Test; 

Results:

结果:

'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'one.test.com'
'one.test.com'
'two.one.test.com'

回答by eville84

remove www., anysubdomain and everything after /:

删除 www., anysubdomain 和 / 之后的所有内容:

SUBSTRING_INDEX((SUBSTRING_INDEX((SUBSTRING_INDEX(url, 'http://', -1)), '/', 1)), '.', -2) as domain

回答by Al Zziwa

All the answers do not seem to work for me. For example, a url that starts variables with a ? fails for some answers. This works for me for all kinds of URLs:

所有的答案似乎都不适合我。例如,一个以 ? 开头的变量的 url。某些答案失败。这对我适用于所有类型的 URL:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(REPLACE(LOWER(url), 'https://', ''), 'http://', ''), '/', 1), '?', 1) AS domain;

回答by bikram kc

If you want to remove www. along with http://, https:// and /(path) from your domain please do this:

如果你想删除 www. 连同来自您域的 http://、https:// 和 /(路径),请执行以下操作:

SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(document_url, '/', 3), '://', -1), '/', 1), '?', 1),'www.',-1)

回答by JYelton

If you want to find the first occurrence of /why are you passing 8 into LOCATEas the starting position? (I assume to skip slashes after the protocol such as http:// but if you also expect url's without a protocol, consider short ones like cnn.com/page)

如果你想找到第一次出现的/为什么你把 8LOCATE作为起始位置?(我假设在 http:// 等协议后跳过斜杠,但如果您还希望 url 没有协议,请考虑使用像 cnn.com/page 这样的短网址)

IF(LOCATE('/', url) > 0, SUBSTRING(url, 1, LOCATE('/', url) - 1), url) AS domain

Alternatively:

或者:

SUBSTRING_INDEX(url, '/', 1) AS domain

The latter seems easier to me.

后者对我来说似乎更容易。

回答by sudhanshu sirohi

select SUBSTRING_INDEX(SUBSTRING_INDEX(URL, '://', -1),'/',1) as DOMAIN

回答by Den

This one works well on my very messy data:

这个对我非常凌乱的数据很有效:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', '.'), '/', 1), '.', -2) AS Domain

回答by user3252809

Best to use it as it will also capture url like "www.google.co.in"

最好使用它,因为它还会捕获像“www.google.co.in”这样的网址

SELECT replace((replace((SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', ''), '/', 1), '*', -2)), 'http:','')),'https:','') AS Domain

回答by David Sayre

The query needs to account for no trailing '/'

查询需要考虑没有尾随 '/'

IF( LOCATE('/', replace(url,'http://', '') ) > 0 , SUBSTRING(replace(url,'http://', ''), 1, LOCATE('/', replace(url,'http://', '') ) - 1), replace(url,'http://', '')) AS domain

回答by Brian Leishman

If you're not afraid of installing MySQL extensions (UDFs), then there's a UDF you can use that does exactly this while respecting different top-level domains like "google.com" and "google.co.uk"

如果您不害怕安装 MySQL 扩展 (UDF),那么您可以使用一个 UDF,它可以在尊重“google.com”和“google.co.uk”等不同顶级域的同时做到这一点

https://github.com/StirlingMarketingGroup/mysql-get-etld-p1

https://github.com/StirlingMarketingGroup/mysql-get-etld-p1

select`get_etld_p1`('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk'
select`get_etld_p1`('https://www.bbc.co.uk/');-- 'bbc.co.uk'
select`get_etld_p1`('https://github.com/StirlingMarketingGroup/');-- 'github.com'
select`get_etld_p1`('https://localhost:10000/index');-- 'localhost'
select`get_etld_p1`('android-app://com.google.android.gm');-- 'com.google.android.gm'
select`get_etld_p1`('example.test.domain.com');-- 'domain.com'
select`get_etld_p1`('postgres://user:[email protected]:5432/path?k=v#f');-- 'host.com'
select`get_etld_p1`('exzvk.omsk.so-ups.ru');-- 'so-ups.ru'
select`get_etld_p1`('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5'
select`get_etld_p1`('not a domain');-- null