bash 使用脚本从网站中提取电子邮件地址

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13858344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 04:01:07  来源:igfitidea点击:

Extract email addresses from a website using scripts

bashemailweb

提问by Open the way

Given a website, I wonder what is the best procedure, programmatically and/or using scripts, to extract all email addresses that are present on each page in plain text in the form [email protected] from that link and all sites underneath, recursively or until some fixed depth.

给定一个网站,我想知道以编程方式和/或使用脚本的最佳程序是什么,以递归方式从该链接和下面的所有站点中以 [email protected] 形式以纯文本形式提取每个页面上存在的所有电子邮件地址或直到某个固定深度。

回答by roq

Using shell programming you can achieve your goal using 2 programs piped together:

使用 shell 编程,您可以使用管道连接的 2 个程序来实现您的目标:

  • wget: will get all pages
  • grep: will filter and give you only the emails
  • wget:将获取所有页面
  • grep:将过滤并只给你电子邮件

An example:

一个例子:

wget -q -r -l 5 -O - http://somesite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"

wget, in quiet mode (-q), is getting all pages recursively (-r) with maximum depth level of 5 (-l 5) from somesite.com.br and printing everything to stdout (-O -).

wget在安静模式 ( -q) 下以递归方式 ( -r) 从 somesite.com.br获取最大深度级别为 5 ( -l 5) 的所有页面,并将所有内容打印到标准输出 ( -O -)。

grepis using an extended regular expression (-E) and showing only (-o) email address.

grep使用扩展正则表达式 ( -E) 并仅显示 ( -o) 电子邮件地址。

All emails are going to be printed to standard output and you can write them to a file by appending > somefile.txtto the command.

所有电子邮件都将打印到标准输出,您可以通过附加> somefile.txt到命令将它们写入文件。

Read the manpages for more documentation on wgetand grep.

阅读man有关wgetgrep 的更多文档的页面。

This example was tested with GNU bashversion 4.2.37(1)-release, GNU grep 2.12 and GNU Wget 1.13.4.

此示例使用 GNU bash版本 4.2.37(1)-release、GNU grep 2.12 和 GNU Wget 1.13.4 进行了测试。

回答by dogbane

First use wgetto recursively download pages from the URL. The -loption is the recusion depth, set to 1below:

首先用于wget从 URL 递归下载页面。该-l选项是recusion深度设置为1以下:

$ mkdir site
$ cd site
$ wget -q -r -l1  http://www.foobar.com

Then run a recursive grepto extract the email addresses. (The regex below is not perfect and may need to be tweaked if you find that not all addresses are being picked up.)

然后运行递归grep以提取电子邮件地址。(下面的正则表达式并不完美,如果您发现并非所有地址都被提取,则可能需要进行调整。)

$ grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" *

As an aside, wgetdoes have an option (-O -) to print downloaded content to stdout instead of saving it to disk but, unfortunately, it does not work in recursive (-r) mode.

顺便说一句,wget确实有一个选项 ( -O -) 将下载的内容打印到标准输出而不是将其保存到磁盘,但不幸的是,它在递归 ( -r) 模式下不起作用。

回答by Ofir

I would have used wgetto get at the pages recursively, and then locate the addresses using regular expressions (I would have used a python scriptfor that, but almost any environment can provide the same functionality).

我会使用wget递归获取页面,然后使用正则表达式定位地址(我会为此使用python 脚本,但几乎任何环境都可以提供相同的功能)。

回答by Rishabh

Point 1). Developers add eMail ID in HTML entity format (rish) HTML Entity:

第 1 点)。开发人员以 HTML 实体格式 (rish) HTML 实体添加电子邮件 ID :

Point 2). Emails are written on href="mailto:[email protected]". So we can take this for Regular expresion.

第 2 点)。电子邮件写在 href="mailto:[email protected]" 上。所以我们可以把它作为正则表达式。

<?php
    $str = '<div class="call-to-action ">
    <a title="Email" class="contact contact-main contact-email " 
    href="mailto:[email protected]?subject=Enquiry%2C%20sent%20from%20yellowpages.com.au&amp;
    body=%0A%0A%0A%0A%0A------------------------------------------%0AEnquiry%20via%20yellowpages.com.au%0Ahttp%3A%2F%2Fyellowpages.com.au%2Fact%2Fphillip%2Fcanberra-eye-laser-15333167-listing.html%3Fcontext%3DbusinessTypeSearch" 
    rel="nofollow" data-email="[email protected]">
    <span class="glyph icon-email border border-dark-blue with-text"></span><span class="contact-text">Email</span>
    <a href="mailto:&#114;&#105;&#115;&#104;&#97;&#98;&#104;&#100;&#117;&#98;&#101;&#121;&#50;&#48;&#64;&#103;&#109;&#97;&#105;&#108;&#46;&#99;&#111;&#109;">
    </a>
    </div>';

// $str = file_get_contents(http://example.com) ; (to get emails from URL in place of file_get_contents i use to prefer CURL) .

     $str = html_entity_decode($str);

    $regex = "/mailto:([^?]*)/";
    if ($rex = preg_match_all($regex, $str,$matches_out)) {

        echo "Found a match!";
        echo "<pre>";
        var_dump($matches_out[0]);
    } else {
        echo "The regex pattern does not match. :(";
    }

    ?>