在50,000个HTML页面中查找电话号码

时间:2020-03-05 18:49:03  来源:igfitidea点击:

我们如何在50,000个HTML页面中找到电话号码?

Jeff Attwood posted 5 Questions for programmers applying for jobs:
  
  In an effort to make life simpler for phone screeners, I've put together
  this list of Five Essential Questions
  that you need to ask during an SDE
  screen. They won't guarantee that your
  candidate will be great, but they will
  help eliminate a huge number of
  candidates who are slipping through
  our process today.
  
  1) Coding The candidate has to write
  some simple code, with correct syntax,
  in C, C++, or Java.
  
  2) OO design The candidate has to
  define basic OO concepts, and come up
  with classes to model a simple
  problem.
  
  3) Scripting and regexes The
  candidate has to describe how to find
  the phone numbers in 50,000 HTML
  pages.
  
  4) Data structures The candidate has
  to demonstrate basic knowledge of the
  most common data structures.
  
  5) Bits and bytes The candidate has
  to answer simple questions about bits,
  bytes, and binary numbers.
  
  Please understand: what I'm looking
  for here is a total vacuum in one of
  these areas. It's OK if they struggle
  a little and then figure it out. It's
  OK if they need some minor hints or
  prompting. I don't mind if they're
  rusty or slow. What you're looking for
  is candidates who are utterly
  clueless, or horribly confused, about
  the area in question.
  
  >>> The Entirety of Jeff′s Original Post <<<

注意:史蒂夫·耶格(Steve Yegge)最初提出了问题。

解决方案

回答

Perl解决方案

上传者:" MH"通过codinghorror,com在2008年9月5日上午7:29

#!/usr/bin/perl
while (<*.html>) {
    my $filename = $_;
    my @data     = <$filename>;

    # Loop once through with simple search
    while (@data) {
        if (/\(?(\d\d\d)\)?[ -]?(\d\d\d)-?(\d\d\d\d)/) {
            push( @files, $filename );
            next;
        }
    }

    # None found, strip html
    $text = "";
    $text .= $_ while (@data);
    $text =~ s#<[^>]+>##gxs;

    # Strip line breaks
    $text =~ s#\n|\r##gxs;

    # Check for occurrence.
    if ( $text =~ /\(?(\d\d\d)\)?[ -]?(\d\d\d)-?(\d\d\d\d)/ ) {
        push( @files, $filename );
        next;
    }
}

# Print out result
print join( '\n', @files );

回答

用Java实现的。正则表达式是从此论坛借来的。

final String regex = "[\s](\({0,1}\d{3}\){0,1}" +
            "[- \.]\d{3}[- \.]\d{4})|" +
            "(\+\d{2}-\d{2,4}-\d{3,4}-\d{3,4})";
    final Pattern phonePattern = Pattern.compile(regex);

    /* The result set */
    Set<File> files = new HashSet<File>();

    File dir = new File("/initDirPath");
    if (!dir.isDirectory()) return;

    for (File file : dir.listFiles()) {
        if (file.isDirectory()) continue;

        BufferedReader reader = new BufferedReader(new FileReader(file));

        String line;
        boolean found = false;
        while ((line = reader.readLine()) != null 
                && !found) {

            if (found = phonePattern.matcher(line).find()) {
                files.add(file);
            }
        }
    }

    for (File file : files) {
        System.out.println(file.getAbsolutePath());
    }

执行了一些测试,一切顺利! :)
请记住,我不是在这里尝试使用最佳设计。刚刚实现了该算法。

回答

egrep'(?\ d {3})?[-\ s。]?\ d {3} [-。] \ d {4}'* .html

回答

egrep "(([0-9]{1,2}.)?[0-9]{3}.[0-9]{3}.[0-9]{4})" . -R --include='*.html'

回答

我喜欢做这些小问题,不能帮助自己。

不确定是否值得这样做,因为它与Java答案非常相似。

private readonly Regex phoneNumExp = new Regex(@"(\({0,1}\d{3}\){0,1}[- \.]\d{3}[- \.]\d{4})|(\+\d{2}-\d{2,4}-\d{3,4}-\d{3,4})");

public HashSet<string> Search(string dir)
{
    var numbers = new HashSet<string>();

    string[] files = Directory.GetFiles(dir, "*.html", SearchOption.AllDirectories);

    foreach (string file in files)
    {
        using (var sr = new StreamReader(file))
        {
            string line;

            while ((line = sr.ReadLine()) != null)
            {
                var match = phoneNumExp.Match(line);

                if (match.Success)
                {
                    numbers.Add(match.Value);
                }
            }
        }
    }

    return numbers;
}

回答

这就是电话面试编码问题不起作用的原因:

电话筛选器:如何在50,000个HTML页面中找到电话号码?

应聘者:请稍等一秒钟(盖手机)嘿(非常擅长编程的室友/朋友/等)​​,如何在50,000个HTML页面中找到电话号码?

保存编码问题,以便在面试中尽早进行,并使面试问题更加个人化,即"我想了解有关上次使用代码解决问题的详细信息"。这是一个要跟进他们细节的问题,要想让其他人为我们回答这个问题而又不会在电话上听起来很奇怪,则要困难得多。

回答

从sieben的Canswer中借用两件事,下面是一个可以完成此任务的Fsnippet。它所缺少的只是一种调用processDirectory的方法,该方法被故意遗漏了:)

open System
open System.IO
open System.Text.RegularExpressions

let rgx = Regex(@"(\({0,1}\d{3}\){0,1}[- \.]\d{3}[- \.]\d{4})|(\+\d{2}-\d{2,4}-\d{3,4}-\d{3,4})", RegexOptions.Compiled)

let processFile contents = contents |> rgx.Matches |> Seq.cast |> Seq.map(fun m -> m.Value)

let processDirectory path = Directory.GetFiles(path, "*.html", SearchOption.AllDirectories) |> Seq.map(File.ReadAllText >> processFile) |> Seq.concat