bash (grep|awk|sed) - 从文件中提取域

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24011251/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 10:36:22  来源:igfitidea点击:

bash (grep|awk|sed) - Extract domains from a file

bashawksedgrep

提问by user3432152

I need to extract domains from a file.

我需要从文件中提取域。

domains.txt:

域名.txt:

eofjoejfej fjpejfe http://ejej.dm1.com dêkkde
ojdoed www.dm2.fr doejd eojd oedj eojdeo
http://dm3.org ieodhjied oejd oejdeo jd
ozjpdj eojdoê jdeojde jdejkd http://dm4.nu/
io d oed 234585 http://jehrhr.dm5.net/hjrehr
[2014-05-31 04:05] eohjpeo jdpiehd pe dpeoe www.dm6.uk/jehr

I need to get:

我需要得到:

dm1.com dm2.fr dm3.org dm4.nu dm5.net dm6.co.uk

dm1.com dm2.fr dm3.org dm4.nu dm5.net dm6.co.uk

回答by Avinash Raj

Try this sed command,

试试这个 sed 命令,

$ sed -r 's/.*(dm[^\.]*\.[^/ ]*).*//g' file
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk

回答by dogbane

This is a bit long, but should work:

这有点长,但应该有效:

grep -oE "http[^ ]*|www[^ ]*" file | sed -e 's|http://||g' -e 's/^www\.//g' -e 's|/.*$||g' -re 's/^.*\.([^\.]+\.[^\.]+$)//g'

Output:

输出:

dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk

回答by dogbane

An answer with gawk:

gawk 的回答:

LC_ALL=C gawk -d -v RS="[[:space:]]+" -v FS="." '
  {
    # Remove the http prefix if it exists
    sub( /http:[/][/]/, "" )

    # Remove the path
    sub( /[/].*$/, "" )

    # Does it look like a domain?
    if ( /^([[:alnum:]]+[.])+[[:alnum:]]+$/ ) {

      # Print the last 2 components of the domain name
      print $(NF-1) "." $NF

    }

  }' file

Some notes:

一些注意事项:

  • Using RS="[[:space:]]"allow us to process each group of letter independently.
  • LC_ALL=Cforces [[:alnum:]]to be ASCII-only (this is not necessary any more with gawk 4+).
  • 使用RS="[[:space:]]"允许我们独立处理每组字母。
  • LC_ALL=C强制[[:alnum:]]为 ASCII-only(这对于 gawk 4+ 不再必要)。

回答by konsolebox

Unrefined method using grep and sed:

使用 grep 和 sed 的未完善方法:

grep -oE '[[:alnum:]]+[.][[:alnum:]_.-]+' file | sed 's/www.//'

Outputs:

输出:

ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk

回答by user3692237

This can be useful:

这很有用:

grep -Pho "(?<=http://)[^(\"|'|[:space:])]*" file.txt | sed 's/www.//g' | grep -Eo '[[:alnum:]]{1,}\.[[:alnum:]]{1,}[.]{0,1}[[:alnum:]]{0,}' | sort | uniq

First grep get 'http://www.example.com' enclosed in single or double quotes, but extract only domain. Second, using 'sed' I remove 'www.', third one extract domain names separated by '.' and in block of two or three alfnumeric characters. At the end, output is ordered to display only single instances of each domain

首先 grep 获取用单引号或双引号括起来的' http://www.example.com',但只提取域。其次,使用 'sed' 我删除了 'www.',第三个提取了由 '.' 分隔的域名。和两个或三个字母数字字符块。最后,输出被排序为仅显示每个域的单个实例