python 获取链接的根域
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1521592/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get Root Domain of Link
提问by Gavin Schulz
I have a link such as http://www.techcrunch.com/and I would like to get just the techcrunch.com part of the link. How do I go about this in python?
我有一个链接,例如http://www.techcrunch.com/,我只想获得链接的 techcrunch.com 部分。我如何在 python 中解决这个问题?
回答by Ben Blank
Getting the hostname is easy enough using urlparse:
使用urlparse获取主机名很容易:
hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.
然而,获取“根域”将更加成问题,因为它没有在句法意义上进行定义。“www.theregister.co.uk”的根域是什么?使用默认域的网络怎么样?“devbox12”可能是有效的主机名。
One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are usedlike TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2library:
处理此问题的一种方法是使用Public Suffix List,它尝试对真正的顶级域(例如“.com”、“.net”、“.org”)以及像 TLD 一样使用的私有域进行编目(例如“.co.uk”甚至“.github.io”)。您可以使用publicsuffix2库从 Python 访问 PSL :
import publicsuffix
import urlparse
def get_base_domain(url):
# This causes an HTTP request; if your script is running more than,
# say, once a day, you'd want to cache it yourself. Make sure you
# update frequently, though!
psl = publicsuffix.fetch()
hostname = urlparse.urlparse(url).hostname
return publicsuffix.get_public_suffix(hostname, psl)
回答by Mohsin
General structure of URL:
URL的一般结构:
scheme://netloc/path;parameters?query#fragment
方案://netloc/path;parameters?query#fragment
As TIMTOWTDImotto:
作为TIMTOWTDI 的座右铭:
Using urlparse,
使用urlparse,
>>> from urllib.parse import urlparse # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever') # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '') # as per your case
>>> print(result)
'stackoverflow.com/'
Using tldextract,
使用tldextract,
>>> import tldextract # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
in your case:
在你的情况下:
>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'
tldextract
on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.
tldextract
另一方面,通过根据公共后缀列表查找当前存在的域名,了解所有 gTLD [通用顶级域] 和 ccTLD [国家代码顶级域] 的样子。所以,给定一个 URL,它从它的域中知道它的子域,从它的国家代码中知道它的域。
Cheerio! :)
啦啦啦!:)
回答by darklow
Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies - especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:
以下脚本并不完美,但可用于显示/缩短目的。如果你真的想要/需要避免任何 3rd 方依赖——尤其是远程获取和缓存一些 tld 数据,我可以建议你遵循我在我的项目中使用的脚本。它使用域的最后两部分作为最常见的域扩展,而将最后三部分用于其余鲜为人知的域扩展。在最坏的情况下,域将包含三个部分而不是两个部分:
from urlparse import urlparse
def extract_domain(url):
parsed_domain = urlparse(url)
domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
domain_parts = domain.split('.')
if len(domain_parts) > 2:
return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
return domain
extract_domain('google.com') # google.com
extract_domain('www.google.com') # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk') # google.co.uk
extract_domain('sub.google.co.uk') # google.co.uk
extract_domain('www.google.com') # google.com
extract_domain('sub.sub2.voila.fr') # sub2.voila.fr
回答by ospider
def get_domain(url):
u = urlsplit(url)
return u.netloc
def get_top_domain(url):
u"""
>>> get_top_domain('http://www.google.com')
'google.com'
>>> get_top_domain('http://www.sina.com.cn')
'sina.com.cn'
>>> get_top_domain('http://bbc.co.uk')
'bbc.co.uk'
>>> get_top_domain('http://mail.cs.buaa.edu.cn')
'buaa.edu.cn'
"""
domain = get_domain(url)
domain_parts = domain.split('.')
if len(domain_parts) < 2:
return domain
top_domain_parts = 2
# if a domain's last part is 2 letter long, it must be country name
if len(domain_parts[-1]) == 2:
if domain_parts[-1] in ['uk', 'jp']:
if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
top_domain_parts = 3
else:
if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
top_domain_parts = 3
return '.'.join(domain_parts[-top_domain_parts:])
回答by azam
______Using Python 3.3 and not 2.x________
______使用 Python 3.3 而不是 2.x________
I would like to add a small thing to Ben Blank's answer.
我想在 Ben Blank 的回答中补充一点。
from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc
By now, I just got the domain name from urlparse.
To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk
- co.uk
is a TLD while in http://sub.twitter.com
we have only .com
as TLD and sub
is a subdomain.
要删除子域,您首先需要知道哪些是顶级域,哪些不是。例如在上面http://twitter.co.uk
-co.uk
是一个 TLD 而在http://sub.twitter.com
我们只有.com
作为 TLD 并且sub
是一个子域。
So, we need to get a file/list which has all the tlds.
因此,我们需要获取一个包含所有tlds的文件/列表。
tlds = load_file("tlds.txt") #tlds holds the list of tlds
tlds = load_file("tlds.txt") #tlds holds the list of tlds
hostname = u.split(".")
if len(hostname)>2:
if hostname[-2].upper() in tlds:
hostname=".".join(hostname[-3:])
else:
hostname=".".join(hostname[-2:])
else:
hostname=".".join(hostname[-2:])
回答by Joe J
This worked for my purposes. I figured I'd share it.
这对我的目的有效。我想我会分享它。
".".join("www.sun.google.com".split(".")[-2:])