Python 找到 http:// 和或 www。并从域中剥离。离开域名.com

Question

提问by Paul Tricklebank

I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.

我对python很陌生。我正在尝试解析一个 URL 文件以只留下域名。

some of the urls in my log file begin with http:// and some begin with www.Some begin with both.

我的日志文件中的某些 url 以 http:// 开头，有些以 www 开头。有些则以两者开头。

This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?

这是我的代码的一部分，它去除了 http:// 部分。我需要添加什么来查找 http 和 www。并删除两者？

line = re.findall(r'(https?://\S+)', line)

Currently when I run the code only http:// is stripped. if I change the code to the following:

目前，当我运行代码时，只有 http:// 被剥离。如果我将代码更改为以下内容：

line = re.findall(r'(https?://www.\S+)', line)

Only domains starting with both are affected. I need the code to be more conditional. TIA

只有以两者开头的域才会受到影响。我需要代码更有条件。TIA

edit... here is my full code...

编辑...这是我的完整代码...

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

I mistagged by original post as regex. it is indeed using urlparse.

我被原始帖子误认为是正则表达式。它确实在使用 urlparse。

Answer 1

采纳答案by sidi

You can do without regexes here.

你可以在这里没有正则表达式。

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

Example file input:

示例文件输入：

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

Output:

输出：

foo.com
foobar.com
bar.com
foobar.com

Edit:

编辑：

There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.

可能有一个像 foobarwww.com 这样棘手的 url，上面的方法会去掉 www。然后我们将不得不恢复使用正则表达式。

Replace the line lines = lines.replace("www.", "")with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.

替换行lines = lines.replace("www.", "")用lines = re.sub(r'(www.)(?!com)',r'',lines)。当然，每个可能的 TLD 都应该用于不匹配模式。

Answer 2

回答by Tom

Check out the urlparse library, which can do these things for you automatically.

查看urlparse 库，它可以自动为你做这些事情。

>>> urlparse.urlsplit('http://www.google.com.au/q?test')
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')

Answer 3

回答by Markus Unterwaditzer

It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit(Python 2) or urllib.parse.urlsplit(Python 3).

对于这种特定情况，这可能有点矫枉过正，但我通常会使用urlparse.urlsplit(Python 2) 或urllib.parse.urlsplit(Python 3)。

from urllib.parse import urlsplit  # Python 3
from urlparse import urlsplit  # Python 2
import re

url = 'www.python.org'

# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid

if not re.match(r'http(s?)\:', url):
    url = 'http://' + url

# url is now 'http://www.python.org'

parsed = urlsplit(url)

# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined

host = parsed.netloc  # www.python.org

# Removing www.
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org

if host.startswith('www.'):
    host = host[4:]

Answer 4

回答by Muneeb Ali

You can use urlparse. Also, the solution should be generic to remove things other than 'www' before the domain name (i.e., handle cases like server1.domain.com). The following is a quick try that should work:

您可以使用urlparse。此外，解决方案应该是通用的，以删除域名前的“www”以外的内容（即处理诸如 server1.domain.com 之类的情况）。以下是应该有效的快速尝试：

from urlparse import urlparse

url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg'

o = urlparse(url)

domain = o.hostname

temp = domain.rsplit('.')

if(len(temp) == 3):
    domain = temp[1] + '.' + temp[2]

print domain

Answer 5

回答by thet

I came across the same problem. This is a solution based on regular expressions:

我遇到了同样的问题。这是基于正则表达式的解决方案：

>>> import re
>>> rec = re.compile(r"https?://(www\.)?")

>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'https://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://www.domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

Answer 6

回答by Claudiu

I believe @Muneeb Ali is the nearest to the solution but the problem appear when is something like frontdomain.domain.co.uk....

我相信@Muneeb Ali 是最接近解决方案的，但问题出现的时候是像 frontdomain.domain.co.uk ....

I suppose:

我想：

for i in range(1,len(temp)-1):
    domain = temp[i]+"."
domain = domain + "." + temp[-1]

Is there a nicer way to do this?

有没有更好的方法来做到这一点？

Python 找到 http:// 和或 www。并从域中剥离。离开域名.com

提问by Paul Tricklebank

采纳答案by sidi

回答by Tom

回答by Markus Unterwaditzer

回答by Muneeb Ali

回答by thet

回答by Claudiu

相关推荐

最近更新

标签

Python 找到 http:// 和或 www。并从域中剥离。离开域名.com

提问by Paul Tricklebank

采纳答案by sidi

回答by Tom

回答by Markus Unterwaditzer

回答by Muneeb Ali

回答by thet

回答by Claudiu

相关推荐

Python 应该避免通配符导入吗？

Python "\r" 在下面的脚本中有什么作用？

Python 在 argparse 中带有破折号的选项

Python 计算日期之间的天数，忽略周末

相关推荐

最近更新

标签