python 获取链接的根域

Question

提问by Gavin Schulz

I have a link such as http://www.techcrunch.com/and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

我有一个链接，例如http://www.techcrunch.com/，我只想获得链接的 techcrunch.com 部分。我如何在 python 中解决这个问题？

Answer 1

回答by Ben Blank

Getting the hostname is easy enough using urlparse:

使用urlparse获取主机名很容易：

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

然而，获取“根域”将更加成问题，因为它没有在句法意义上进行定义。“www.theregister.co.uk”的根域是什么？使用默认域的网络怎么样？“devbox12”可能是有效的主机名。

One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are usedlike TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2library:

处理此问题的一种方法是使用Public Suffix List，它尝试对真正的顶级域（例如“.com”、“.net”、“.org”）以及像 TLD 一样使用的私有域进行编目（例如“.co.uk”甚至“.github.io”）。您可以使用publicsuffix2库从 Python 访问 PSL ：

import publicsuffix
import urlparse

def get_base_domain(url):
    # This causes an HTTP request; if your script is running more than,
    # say, once a day, you'd want to cache it yourself.  Make sure you
    # update frequently, though!
    psl = publicsuffix.fetch()

    hostname = urlparse.urlparse(url).hostname

    return publicsuffix.get_public_suffix(hostname, psl)

Answer 2

回答by Mohsin

General structure of URL:

URL的一般结构：

scheme://netloc/path;parameters?query#fragment

方案://netloc/path;parameters?query#fragment

As TIMTOWTDImotto:

作为TIMTOWTDI 的座右铭：

Using urlparse,

使用urlparse，

>>> from urllib.parse import urlparse  # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '')  # as per your case
>>> print(result)
'stackoverflow.com/'

Using tldextract,

使用tldextract，

>>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

in your case:

在你的情况下：

>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'

tldextracton the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

tldextract另一方面，通过根据公共后缀列表查找当前存在的域名，了解所有 gTLD [通用顶级域] 和 ccTLD [国家代码顶级域] 的样子。所以，给定一个 URL，它从它的域中知道它的子域，从它的国家代码中知道它的域。

Cheerio! :)

啦啦啦！:)

Answer 3

回答by darklow

Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies - especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:

以下脚本并不完美，但可用于显示/缩短目的。如果你真的想要/需要避免任何 3rd 方依赖——尤其是远程获取和缓存一些 tld 数据，我可以建议你遵循我在我的项目中使用的脚本。它使用域的最后两部分作为最常见的域扩展，而将最后三部分用于其余鲜为人知的域扩展。在最坏的情况下，域将包含三个部分而不是两个部分：

from urlparse import urlparse

def extract_domain(url):
    parsed_domain = urlparse(url)
    domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
    domain_parts = domain.split('.')
    if len(domain_parts) > 2:
        return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
            'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
    return domain

extract_domain('google.com')          # google.com
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk')        # google.co.uk
extract_domain('sub.google.co.uk')    # google.co.uk
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.voila.fr')   # sub2.voila.fr

Answer 4

回答by ospider

def get_domain(url):
    u = urlsplit(url)
    return u.netloc

def get_top_domain(url):
    u"""
    >>> get_top_domain('http://www.google.com')
    'google.com'
    >>> get_top_domain('http://www.sina.com.cn')
    'sina.com.cn'
    >>> get_top_domain('http://bbc.co.uk')
    'bbc.co.uk'
    >>> get_top_domain('http://mail.cs.buaa.edu.cn')
    'buaa.edu.cn'
    """
    domain = get_domain(url)
    domain_parts = domain.split('.')
    if len(domain_parts) < 2:
        return domain
    top_domain_parts = 2
    # if a domain's last part is 2 letter long, it must be country name
    if len(domain_parts[-1]) == 2:
        if domain_parts[-1] in ['uk', 'jp']:
            if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
                top_domain_parts = 3
        else:
            if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
                top_domain_parts = 3
    return '.'.join(domain_parts[-top_domain_parts:])

Answer 5

回答by azam

______Using Python 3.3 and not 2.x________

______使用 Python 3.3 而不是 2.x________

I would like to add a small thing to Ben Blank's answer.

我想在 Ben Blank 的回答中补充一点。

from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc

By now, I just got the domain name from urlparse.

到现在为止，我刚刚从urlparse获得了域名。

To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk- co.ukis a TLD while in http://sub.twitter.comwe have only .comas TLD and subis a subdomain.

要删除子域，您首先需要知道哪些是顶级域，哪些不是。例如在上面http://twitter.co.uk-co.uk是一个 TLD 而在http://sub.twitter.com我们只有.com作为 TLD 并且sub是一个子域。

So, we need to get a file/list which has all the tlds.

因此，我们需要获取一个包含所有tlds的文件/列表。

tlds = load_file("tlds.txt") #tlds holds the list of tlds

hostname = u.split(".")
if len(hostname)>2:
    if hostname[-2].upper() in tlds:
        hostname=".".join(hostname[-3:])
    else:
        hostname=".".join(hostname[-2:])
else:
    hostname=".".join(hostname[-2:])

Answer 6

回答by Joe J

This worked for my purposes. I figured I'd share it.

这对我的目的有效。我想我会分享它。

".".join("www.sun.google.com".split(".")[-2:])

python 获取链接的根域

提问by Gavin Schulz

回答by Ben Blank

回答by Mohsin

回答by darklow

回答by ospider

回答by azam

回答by Joe J

相关推荐

最近更新

标签

python 获取链接的根域

提问by Gavin Schulz

回答by Ben Blank

回答by Mohsin

回答by darklow

回答by ospider

回答by azam

回答by Joe J

相关推荐

Python urllib2 HTTPS 和代理 NTLM 身份验证

python `xrange(2**100)` -> OverflowError: long int 太大而无法转换为 int

python django 模型表单。包括来自相关模型的字段

Python PIL 找不到我的“libjpeg”

相关推荐

最近更新

标签