Python urlparse -- 提取没有子域的域名

Question

提问by Clay Wardell

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

需要一种使用 Python urlparse 从 url 中提取没有子域的域名的方法。

For example, I would like to extract "google.com"from a full url like "http://www.google.com".

例如，我想"google.com"从像"http://www.google.com".

The closest I can seem to come with urlparseis the netlocattribute, but that includes the subdomain, which in this example would be www.google.com.

我似乎最接近的urlparse是netloc属性，但这包括子域，在本例中为www.google.com.

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

我知道可以编写一些自定义字符串操作来将 www.google.com 转换为 google.com，但我想在此任务中避免手动字符串转换或正则表达式。（这样做的原因是我对 url 形成规则不够熟悉，无法确信我可以考虑编写自定义解析函数所需的每个边缘情况。）

Or, if urlparsecan't do what I need, does anyone know any other Python url-parsing libraries that would?

或者，如果urlparse不能做我需要的事情，有没有人知道任何其他的 Python url 解析库呢？

Answer 1

采纳答案by Gareth Latty

You probably want to check out tldextract, a library designed to do this kind of thing.

您可能想查看tldextract，这是一个旨在执行此类操作的库。

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

它使用公共后缀列表来尝试根据已知的 gTLD 进行适当的拆分，但请注意，这只是一个蛮力列表，没有什么特别之处，因此它可能会过时（尽管希望它经过精心策划，以免）。

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

So in your case:

所以在你的情况下：

>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"

Answer 2

回答by Has QUIT--Anony-Mousse

This is not a standard decompositionof the URLs.

这不是URL的标准分解。

You cannot rely on the www.to be present or optional. In a lot of cases it will not.

您不能依赖www.存在或可选。在很多情况下它不会。

So if you do want to assume that only the last two components are relevant (which also won't work for the uk, e.g. www.google.co.uk) then you can do a split('.')[-2:].

因此，如果您确实想假设只有最后两个组件是相关的（这也不适用于英国，例如www.google.co.uk），那么您可以执行split('.')[-2:].

Or, which is actually less error prone, strip a www.prefix.

或者，实际上不太容易出错，去掉www.前缀。

But in either way you cannot assume that the www.is optional, because it will NOT work every time!

但是无论哪种方式，您都不能假设www.是可选的，因为它不会每次都起作用！

Here is a list of common suffixes for domains. You can try to keep the suffix + one component.

以下是域的常用后缀列表。您可以尝试保留后缀+一个组件。

https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

But how do you plan to handle for example first.last.namedomains? Assume that all the users with the same last name are the same company? Initially, you would only be able to get third-level domains there. By now, you apparently can get second level, too. So for .namethere is no general rule.

但是您打算如何处理例如first.last.name域？假设所有姓氏相同的用户都是同一家公司？最初，您只能在那里获得三级域名。到目前为止，您显然也可以获得第二级。所以.name没有普遍规律。

Answer 3

回答by Andrea Moro

Using the tldexport works fine, but apparently has a problem while parsing the blogspot.com subdomain and create a mess. If you would like to go ahead with that library, make sure to implement an if condition or something to prevent returning an empty string in the subdomain.

使用 tldexport 工作正常，但在解析 blogspot.com 子域时显然有问题并造成混乱。如果您想继续使用该库，请确保实现 if 条件或其他内容以防止在子域中返回空字符串。

Answer 4

回答by sandyp

For domain name manipulation, you can also use Dnspy

对于域名操作，也可以使用Dnspy

It helps extract domains (and domain labels) at various levels, using a fresh copy of Mozilla Public Suffix list.

它使用 Mozilla 公共后缀列表的新副本帮助在各个级别提取域（和域标签）。

Answer 5

回答by Andy

This is an update, based on the bounty request for an updated answer

这是一个更新，基于对更新答案的赏金请求

Start by using the tldpackage. A description of the package:

首先使用tld包。包装说明：

Extracts the top level domain (TLD) from the URL given. List of TLD names is taken from Mozilla http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat?raw=1

从给定的 URL 中提取顶级域 (TLD)。TLD 名称列表取自 Mozilla http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat?raw=1

from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()

print get_tld("http://www.google.co.uk")
print get_tld("http://zap.co.it")
print get_tld("http://google.com")
print get_tld("http://mail.google.com")
print get_tld("http://mail.google.co.uk")
print get_tld("http://google.co.uk")

This outputs

这输出

google.co.uk
zap.co.it
google.com
google.com
google.co.uk
google.co.uk

Notice that it correctly handles country level TLDs by leaving co.ukand co.it, but properly removes the wwwand mailsubdomains for both .comand .co.uk

请注意，它通过离开co.uk和来正确处理国家级 TLD co.it，但正确删除了www和的和mail子域.com和.co.uk

The update_tld_names()call at the beginning of the script is used to update/sync the tld names with the most recent version from Mozilla.

update_tld_names()脚本开头的调用用于使用 Mozilla 的最新版本更新/同步 tld 名称。

Answer 6

回答by Danial Frs

from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()

result=get_tld('http://www.google.com')
print 'https://'+result

Input: http://www.google.com

输入：http: //www.google.com

Result: google.com

结果：google.com

Answer 7

回答by tripleee

There are multiple Python modules which encapsulate the (once Mozilla) Public Suffix List in a library, several of which don'trequire the input to be a URL. Even though the question asks about URL normalization specifically, my requirement was to handle just domain names, and so I'm offering a tangential answer for that.

有多种Python模块，其封装（一旦Mozilla的）公共后缀列表中的一个库，其中有几个不要求输入的是一个URL。尽管该问题专门询问了 URL 规范化，但我的要求是仅处理域名，因此我为此提供了切线答案。

The relative merits of publicsuffix2over publicsuffixlistor publicsuffixare unclear, but they all seem to offer the basic functionality.

publicsuffix2 相对于publicsuffixlist或publicsuffix的相对优点尚不清楚，但它们似乎都提供了基本功能。

publicsuffix2:

公共后缀2：

>>> import publicsuffix  # sic
>>> publicsuffix.PublicSuffixList().get_public_suffix('www.google.co.uk')
u'google.co.uk'

Supposedly more packaging-friendly fork of publicsuffix.

据说更适合包装的publicsuffix.

publicsuffixlist:

公共后缀列表：

>>> import publicsuffixlist
>>> publicsuffixlist.PublicSuffixList().privatesuffix('www.google.co.uk')
'google.co.uk'

Advertises idnasupport, which I however have not tested.

广告idna支持，但我还没有测试过。

publicsuffix:

公共后缀：

>>> import publicsuffix
>>> publicsuffix.PublicSuffixList(publicsuffix.fetch()).get_public_suffix('www.google.co.uk')
'google.co.uk'

The requirement to handle the updates and caching the downloaded file yourself is a bit of a complication.

自己处理更新和缓存下载的文件的要求有点复杂。

Python urlparse -- 提取没有子域的域名

提问by Clay Wardell

采纳答案by Gareth Latty

回答by Has QUIT--Anony-Mousse

回答by Andrea Moro

回答by sandyp

回答by Andy

回答by Danial Frs

回答by tripleee

相关推荐

最近更新

标签

Python urlparse -- 提取没有子域的域名

提问by Clay Wardell

采纳答案by Gareth Latty

回答by Has QUIT--Anony-Mousse

回答by Andrea Moro

回答by sandyp

回答by Andy

回答by Danial Frs

回答by tripleee

相关推荐

Python 如何将随机数分配给变量？

如何同时运行两个 python 循环？

Python 如何将后台线程添加到烧瓶？

在python中通过分隔符分割字符串

相关推荐

最近更新

标签