bash (grep|awk|sed) - 从文件中提取域
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24011251/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
bash (grep|awk|sed) - Extract domains from a file
提问by user3432152
I need to extract domains from a file.
我需要从文件中提取域。
domains.txt:
域名.txt:
eofjoejfej fjpejfe http://ejej.dm1.com dêkkde
ojdoed www.dm2.fr doejd eojd oedj eojdeo
http://dm3.org ieodhjied oejd oejdeo jd
ozjpdj eojdoê jdeojde jdejkd http://dm4.nu/
io d oed 234585 http://jehrhr.dm5.net/hjrehr
[2014-05-31 04:05] eohjpeo jdpiehd pe dpeoe www.dm6.uk/jehr
I need to get:
我需要得到:
dm1.com dm2.fr dm3.org dm4.nu dm5.net dm6.co.uk
dm1.com dm2.fr dm3.org dm4.nu dm5.net dm6.co.uk
回答by Avinash Raj
Try this sed command,
试试这个 sed 命令,
$ sed -r 's/.*(dm[^\.]*\.[^/ ]*).*//g' file
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk
回答by dogbane
This is a bit long, but should work:
这有点长,但应该有效:
grep -oE "http[^ ]*|www[^ ]*" file | sed -e 's|http://||g' -e 's/^www\.//g' -e 's|/.*$||g' -re 's/^.*\.([^\.]+\.[^\.]+$)//g'
Output:
输出:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk
回答by dogbane
An answer with gawk:
gawk 的回答:
LC_ALL=C gawk -d -v RS="[[:space:]]+" -v FS="." '
{
# Remove the http prefix if it exists
sub( /http:[/][/]/, "" )
# Remove the path
sub( /[/].*$/, "" )
# Does it look like a domain?
if ( /^([[:alnum:]]+[.])+[[:alnum:]]+$/ ) {
# Print the last 2 components of the domain name
print $(NF-1) "." $NF
}
}' file
Some notes:
一些注意事项:
- Using
RS="[[:space:]]"
allow us to process each group of letter independently. LC_ALL=C
forces[[:alnum:]]
to be ASCII-only (this is not necessary any more with gawk 4+).
- 使用
RS="[[:space:]]"
允许我们独立处理每组字母。 LC_ALL=C
强制[[:alnum:]]
为 ASCII-only(这对于 gawk 4+ 不再必要)。
回答by konsolebox
Unrefined method using grep and sed:
使用 grep 和 sed 的未完善方法:
grep -oE '[[:alnum:]]+[.][[:alnum:]_.-]+' file | sed 's/www.//'
Outputs:
输出:
ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk
回答by user3692237
This can be useful:
这很有用:
grep -Pho "(?<=http://)[^(\"|'|[:space:])]*" file.txt | sed 's/www.//g' | grep -Eo '[[:alnum:]]{1,}\.[[:alnum:]]{1,}[.]{0,1}[[:alnum:]]{0,}' | sort | uniq
First grep get 'http://www.example.com' enclosed in single or double quotes, but extract only domain. Second, using 'sed' I remove 'www.', third one extract domain names separated by '.' and in block of two or three alfnumeric characters. At the end, output is ordered to display only single instances of each domain
首先 grep 获取用单引号或双引号括起来的' http://www.example.com',但只提取域。其次,使用 'sed' 我删除了 'www.',第三个提取了由 '.' 分隔的域名。和两个或三个字母数字字符块。最后,输出被排序为仅显示每个域的单个实例