bash (grep|awk|sed) - 从文件中提取域

Question

提问by user3432152

I need to extract domains from a file.

我需要从文件中提取域。

domains.txt:

域名.txt：

eofjoejfej fjpejfe http://ejej.dm1.com dêkkde
ojdoed www.dm2.fr doejd eojd oedj eojdeo
http://dm3.org ieodhjied oejd oejdeo jd
ozjpdj eojdoê jdeojde jdejkd http://dm4.nu/
io d oed 234585 http://jehrhr.dm5.net/hjrehr
[2014-05-31 04:05] eohjpeo jdpiehd pe dpeoe www.dm6.uk/jehr

I need to get:

我需要得到：

dm1.com dm2.fr dm3.org dm4.nu dm5.net dm6.co.uk

Answer 1

回答by Avinash Raj

Try this sed command,

试试这个 sed 命令，

$ sed -r 's/.*(dm[^\.]*\.[^/ ]*).*//g' file
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk

Answer 2

回答by dogbane

This is a bit long, but should work:

这有点长，但应该有效：

grep -oE "http[^ ]*|www[^ ]*" file | sed -e 's|http://||g' -e 's/^www\.//g' -e 's|/.*$||g' -re 's/^.*\.([^\.]+\.[^\.]+$)//g'

Output:

输出：

dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk

Answer 3

回答by dogbane

An answer with gawk:

gawk 的回答：

LC_ALL=C gawk -d -v RS="[[:space:]]+" -v FS="." '
  {
    # Remove the http prefix if it exists
    sub( /http:[/][/]/, "" )

    # Remove the path
    sub( /[/].*$/, "" )

    # Does it look like a domain?
    if ( /^([[:alnum:]]+[.])+[[:alnum:]]+$/ ) {

      # Print the last 2 components of the domain name
      print $(NF-1) "." $NF

    }

  }' file

Some notes:

一些注意事项：

Using RS="[[:space:]]"allow us to process each group of letter independently.
LC_ALL=Cforces [[:alnum:]]to be ASCII-only (this is not necessary any more with gawk 4+).

使用RS="[[:space:]]"允许我们独立处理每组字母。
LC_ALL=C强制[[:alnum:]]为 ASCII-only（这对于 gawk 4+ 不再必要）。

Answer 4

回答by konsolebox

Unrefined method using grep and sed:

使用 grep 和 sed 的未完善方法：

grep -oE '[[:alnum:]]+[.][[:alnum:]_.-]+' file | sed 's/www.//'

Outputs:

输出：

ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk

Answer 5

回答by user3692237

This can be useful:

这很有用：

grep -Pho "(?<=http://)[^(\"|'|[:space:])]*" file.txt | sed 's/www.//g' | grep -Eo '[[:alnum:]]{1,}\.[[:alnum:]]{1,}[.]{0,1}[[:alnum:]]{0,}' | sort | uniq

First grep get 'http://www.example.com' enclosed in single or double quotes, but extract only domain. Second, using 'sed' I remove 'www.', third one extract domain names separated by '.' and in block of two or three alfnumeric characters. At the end, output is ordered to display only single instances of each domain

首先 grep 获取用单引号或双引号括起来的' http://www.example.com'，但只提取域。其次，使用 'sed' 我删除了 'www.'，第三个提取了由 '.' 分隔的域名。和两个或三个字母数字字符块。最后，输出被排序为仅显示每个域的单个实例

bash (grep|awk|sed) - 从文件中提取域

提问by user3432152

回答by Avinash Raj

回答by dogbane

回答by dogbane

回答by konsolebox

回答by user3692237

相关推荐

最近更新

标签

bash (grep|awk|sed) - 从文件中提取域

提问by user3432152

回答by Avinash Raj

回答by dogbane

回答by dogbane

回答by konsolebox

回答by user3692237

相关推荐

如何使用 SSH 命令将递归目录和文件列表导出到 Linux Bash shell 中的文本文件？

GNU Parallel 和 Bash 函数：如何运行手册中的简单示例

Bash：连接多个文件并在每个文件之间添加“\newline”？

bash: /bin/tar: 使用 tar 压缩许多文件时参数列表太长

相关推荐

最近更新

标签