Mysql 查询以从 url 中提取域
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9280336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Mysql query to extract domains from urls
提问by peppolone
sorry for my english
对不起我的英语不好
i have this query to extract domain from urls
我有这个查询从网址中提取域
SELECT SUBSTRING(LEFT(url, LOCATE('/', url, 8) - 1), 8) AS domain...
it works only when the url is like www.google.com/something
仅当网址类似于 www.google.com/something 时才有效
it doesn't work with urls like
它不适用于像这样的网址
www.google.it (without trailing slash)
www.google.it(不带斜杠)
www.google.it/abc/xzy/ (permalink)
www.google.it/abc/xzy/(永久链接)
do you know how to solve it?
你知道如何解决吗?
回答by LostNomad311
I had to combine some of the previous answers , plus a little more hackery for my data set . This is what works for me , it returns the domain and any sub-domains:
我不得不结合一些以前的答案,再加上我的数据集的一些hackery。这对我有用,它返回域和任何子域:
SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
Explanation ( cause non-trivial SQL rarely makes sense ):
解释(导致非平凡的 SQL 很少有意义):
SUBSTRING_INDEX(target_url, '/', 3)
- strips any path if the url has a protocolSUBSTRING_INDEX(THAT, '://', -1)
- strips any protocol from THATSUBSTRING_INDEX(THAT, '/', 1)
- strips any path from THAT ( if there was no protocol )SUBSTRING_INDEX(THAT, '?', 1)
- strips the query string from THAT ( if there was no path or trailing / )
SUBSTRING_INDEX(target_url, '/', 3)
- 如果 url 有SUBSTRING_INDEX(THAT, '://', -1)
协议,则删除SUBSTRING_INDEX(THAT, '/', 1)
任何路径 -从 THAT 中删除任何协议 - 从 THAT 中删除任何路径(如果没有协议)SUBSTRING_INDEX(THAT, '?', 1)
- 从 THAT 中删除查询字符串(如果没有路径或尾随 / )
Test Cases:
测试用例:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
FROM (
SELECT 'http://test.com' as target_url
UNION SELECT 'https://test.com'
UNION SELECT 'http://test.com/one'
UNION SELECT 'http://test.com/?huh'
UNION SELECT 'http://test.com?http://ouch.foo'
UNION SELECT 'test.com'
UNION SELECT 'test.com/one'
UNION SELECT 'test.com/one/two'
UNION SELECT 'test.com/one/two/three'
UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
UNION SELECT 'http://one.test.com'
UNION SELECT 'one.test.com/one'
UNION SELECT 'two.one.test.com/one' ) AS Test;
Results:
结果:
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'one.test.com'
'one.test.com'
'two.one.test.com'
回答by eville84
remove www., anysubdomain and everything after /:
删除 www., anysubdomain 和 / 之后的所有内容:
SUBSTRING_INDEX((SUBSTRING_INDEX((SUBSTRING_INDEX(url, 'http://', -1)), '/', 1)), '.', -2) as domain
回答by Al Zziwa
All the answers do not seem to work for me. For example, a url that starts variables with a ? fails for some answers. This works for me for all kinds of URLs:
所有的答案似乎都不适合我。例如,一个以 ? 开头的变量的 url。某些答案失败。这对我适用于所有类型的 URL:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(REPLACE(LOWER(url), 'https://', ''), 'http://', ''), '/', 1), '?', 1) AS domain;
回答by bikram kc
If you want to remove www. along with http://, https:// and /(path) from your domain please do this:
如果你想删除 www. 连同来自您域的 http://、https:// 和 /(路径),请执行以下操作:
SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(document_url, '/', 3), '://', -1), '/', 1), '?', 1),'www.',-1)
回答by JYelton
If you want to find the first occurrence of /
why are you passing 8 into LOCATE
as the starting position? (I assume to skip slashes after the protocol such as http:// but if you also expect url's without a protocol, consider short ones like cnn.com/page)
如果你想找到第一次出现的/
为什么你把 8LOCATE
作为起始位置?(我假设在 http:// 等协议后跳过斜杠,但如果您还希望 url 没有协议,请考虑使用像 cnn.com/page 这样的短网址)
IF(LOCATE('/', url) > 0, SUBSTRING(url, 1, LOCATE('/', url) - 1), url) AS domain
Alternatively:
或者:
SUBSTRING_INDEX(url, '/', 1) AS domain
The latter seems easier to me.
后者对我来说似乎更容易。
回答by sudhanshu sirohi
select SUBSTRING_INDEX(SUBSTRING_INDEX(URL, '://', -1),'/',1) as DOMAIN
回答by Den
This one works well on my very messy data:
这个对我非常凌乱的数据很有效:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', '.'), '/', 1), '.', -2) AS Domain
回答by user3252809
Best to use it as it will also capture url like "www.google.co.in"
最好使用它,因为它还会捕获像“www.google.co.in”这样的网址
SELECT replace((replace((SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', ''), '/', 1), '*', -2)), 'http:','')),'https:','') AS Domain
回答by David Sayre
The query needs to account for no trailing '/'
查询需要考虑没有尾随 '/'
IF( LOCATE('/', replace(url,'http://', '') ) > 0 , SUBSTRING(replace(url,'http://', ''), 1, LOCATE('/', replace(url,'http://', '') ) - 1), replace(url,'http://', '')) AS domain
回答by Brian Leishman
If you're not afraid of installing MySQL extensions (UDFs), then there's a UDF you can use that does exactly this while respecting different top-level domains like "google.com" and "google.co.uk"
如果您不害怕安装 MySQL 扩展 (UDF),那么您可以使用一个 UDF,它可以在尊重“google.com”和“google.co.uk”等不同顶级域的同时做到这一点
https://github.com/StirlingMarketingGroup/mysql-get-etld-p1
https://github.com/StirlingMarketingGroup/mysql-get-etld-p1
select`get_etld_p1`('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk'
select`get_etld_p1`('https://www.bbc.co.uk/');-- 'bbc.co.uk'
select`get_etld_p1`('https://github.com/StirlingMarketingGroup/');-- 'github.com'
select`get_etld_p1`('https://localhost:10000/index');-- 'localhost'
select`get_etld_p1`('android-app://com.google.android.gm');-- 'com.google.android.gm'
select`get_etld_p1`('example.test.domain.com');-- 'domain.com'
select`get_etld_p1`('postgres://user:[email protected]:5432/path?k=v#f');-- 'host.com'
select`get_etld_p1`('exzvk.omsk.so-ups.ru');-- 'so-ups.ru'
select`get_etld_p1`('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5'
select`get_etld_p1`('not a domain');-- null