bash 使用脚本从网站中提取电子邮件地址

Question

提问by Open the way

Given a website, I wonder what is the best procedure, programmatically and/or using scripts, to extract all email addresses that are present on each page in plain text in the form [email protected] from that link and all sites underneath, recursively or until some fixed depth.

给定一个网站，我想知道以编程方式和/或使用脚本的最佳程序是什么，以递归方式从该链接和下面的所有站点中以 [email protected] 形式以纯文本形式提取每个页面上存在的所有电子邮件地址或直到某个固定深度。

Answer 1

回答by roq

Using shell programming you can achieve your goal using 2 programs piped together:

使用 shell 编程，您可以使用管道连接的 2 个程序来实现您的目标：

wget: will get all pages
grep: will filter and give you only the emails

wget：将获取所有页面
grep：将过滤并只给你电子邮件

An example:

一个例子：

wget -q -r -l 5 -O - http://somesite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"

wget, in quiet mode (-q), is getting all pages recursively (-r) with maximum depth level of 5 (-l 5) from somesite.com.br and printing everything to stdout (-O -).

wget在安静模式 ( -q) 下以递归方式 ( -r) 从 somesite.com.br获取最大深度级别为 5 ( -l 5) 的所有页面，并将所有内容打印到标准输出 ( -O -)。

grepis using an extended regular expression (-E) and showing only (-o) email address.

grep使用扩展正则表达式 ( -E) 并仅显示 ( -o) 电子邮件地址。

All emails are going to be printed to standard output and you can write them to a file by appending > somefile.txtto the command.

所有电子邮件都将打印到标准输出，您可以通过附加> somefile.txt到命令将它们写入文件。

Read the manpages for more documentation on wgetand grep.

阅读man有关wget和grep 的更多文档的页面。

This example was tested with GNU bashversion 4.2.37(1)-release, GNU grep 2.12 and GNU Wget 1.13.4.

此示例使用 GNU bash版本 4.2.37(1)-release、GNU grep 2.12 和 GNU Wget 1.13.4 进行了测试。

Answer 2

回答by dogbane

First use wgetto recursively download pages from the URL. The -loption is the recusion depth, set to 1below:

首先用于wget从 URL 递归下载页面。该-l选项是recusion深度设置为1以下：

$ mkdir site
$ cd site
$ wget -q -r -l1  http://www.foobar.com

Then run a recursive grepto extract the email addresses. (The regex below is not perfect and may need to be tweaked if you find that not all addresses are being picked up.)

然后运行递归grep以提取电子邮件地址。（下面的正则表达式并不完美，如果您发现并非所有地址都被提取，则可能需要进行调整。）

$ grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" *

As an aside, wgetdoes have an option (-O -) to print downloaded content to stdout instead of saving it to disk but, unfortunately, it does not work in recursive (-r) mode.

顺便说一句，wget确实有一个选项 ( -O -) 将下载的内容打印到标准输出而不是将其保存到磁盘，但不幸的是，它在递归 ( -r) 模式下不起作用。

Answer 3

回答by Ofir

I would have used wgetto get at the pages recursively, and then locate the addresses using regular expressions (I would have used a python scriptfor that, but almost any environment can provide the same functionality).

我会使用wget递归获取页面，然后使用正则表达式定位地址（我会为此使用python 脚本，但几乎任何环境都可以提供相同的功能）。

Answer 4

回答by Rishabh

Point 1). Developers add eMail ID in HTML entity format (rish) HTML Entity:

第 1 点）。开发人员以 HTML 实体格式 (rish) HTML 实体添加电子邮件 ID ：

Point 2). Emails are written on href="mailto:[email protected]". So we can take this for Regular expresion.

第 2 点）。电子邮件写在 href="mailto:[email protected]" 上。所以我们可以把它作为正则表达式。

<?php
    $str = '<div class="call-to-action ">
    <a title="Email" class="contact contact-main contact-email " 
    href="mailto:[email protected]?subject=Enquiry%2C%20sent%20from%20yellowpages.com.au&amp;
    body=%0A%0A%0A%0A%0A------------------------------------------%0AEnquiry%20via%20yellowpages.com.au%0Ahttp%3A%2F%2Fyellowpages.com.au%2Fact%2Fphillip%2Fcanberra-eye-laser-15333167-listing.html%3Fcontext%3DbusinessTypeSearch" 
    rel="nofollow" data-email="[email protected]">
    <span class="glyph icon-email border border-dark-blue with-text"></span><span class="contact-text">Email</span>
    <a href="mailto:&#114;&#105;&#115;&#104;&#97;&#98;&#104;&#100;&#117;&#98;&#101;&#121;&#50;&#48;&#64;&#103;&#109;&#97;&#105;&#108;&#46;&#99;&#111;&#109;">
    </a>
    </div>';

// $str = file_get_contents(http://example.com) ; (to get emails from URL in place of file_get_contents i use to prefer CURL) .

     $str = html_entity_decode($str);

    $regex = "/mailto:([^?]*)/";
    if ($rex = preg_match_all($regex, $str,$matches_out)) {

        echo "Found a match!";
        echo "<pre>";
        var_dump($matches_out[0]);
    } else {
        echo "The regex pattern does not match. :(";
    }

    ?>

bash 使用脚本从网站中提取电子邮件地址

提问by Open the way

回答by roq

回答by dogbane

回答by Ofir

回答by Rishabh

相关推荐

最近更新

标签

bash 使用脚本从网站中提取电子邮件地址

提问by Open the way

回答by roq

回答by dogbane

回答by Ofir

回答by Rishabh

相关推荐

bash 用于检查 git 更改然后遍历更改的文件的 Shell 脚本？

bash "(head; tail) < file" 是如何工作的？

bash 如何在没有顶级文件夹的情况下压缩文件但保留子文件夹

将路径转换为字符串 bash

相关推荐

最近更新

标签