Java:基于正则表达式在 HashMap 键中搜索?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/879807/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 20:35:01  来源:igfitidea点击:

Java: Search in HashMap keys based on regex?

javaregexhashmap

提问by Dan Burzo

I'm building a thesaurus using a HashMap to store the synonyms.

我正在使用 HashMap 构建同义词库来存储同义词。

I'm trying to search through the words based on a regular expression: the method will have to take a string as parameter and return an array of results. Here's my first stab at it:

我正在尝试根据正则表达式搜索单词:该方法必须将字符串作为参数并返回结果数组。这是我的第一次尝试:

public ArrayList<String> searchDefinition(String regex) {
    ArrayList<String> results = new ArrayList<String>();

    Pattern p = Pattern.compile(regex);

    Set<String> keys = thesaurus.keySet();
    Iterator<String> ite = keys.iterator();

    while (ite.hasNext()) {
        String candidate = ite.next();
        Matcher m = p.matcher(candidate);
        System.out.println("Attempting to match: " + candidate + " to "  + regex);
        if (m.matches()) {
            System.out.println("it matches");
            results.add(candidate);
        }
    }   

    if (results.isEmpty()) {
        return null;
    }
    else {
        return results;
    }
}

Now, this does not work as I would expect (or maybe I'm using regular expressions incorrectly). If I have the following keys in the hashmap:

现在,这不像我期望的那样工作(或者我可能错误地使用了正则表达式)。如果我在哈希图中有以下键:

cat, car, chopper

then by calling searchDefinition("c")or searchDefinition("c*")I get null.

然后通过电话searchDefinition("c")searchDefinition("c*")我得到null

  1. How do I make this work as expected?
  2. Is there a better data structure than HashMap to keep a graphlike needed by a thesaurus? (curiosity only, as for this assignment we're asked to use Java Collection Map).
  3. Anything else I'm doing innapropriately in the code above?
  1. 我如何使这项工作按预期进行?
  2. 有没有比 HashMap 更好的数据结构来保持graph同义词库所需的喜欢?(只是出于好奇,对于这个任务,我们被要求使用 Java Collection Map)。
  3. 我在上面的代码中不恰当地做了什么?

Thanks, Dan

谢谢,丹

EDIT: I've corrected the example. It doesn't work even if I use the correct case.

编辑:我已经更正了这个例子。即使我使用正确的案例,它也不起作用。

采纳答案by Clint

You need to specify case insensitivity Pattern.compile( "c",Pattern.CASE_INSENSITIVE). To find a word with a cin it you need to use matcher.find(). Matcher.matches()tries to match the whole string.

您需要指定不区分大小写的Pattern.compile ( "c",Pattern.CASE_INSENSITIVE)。要查找带有 a 的单词,c您需要使用matcher.find()Matcher.matches()尝试匹配整个字符串。

回答by Kip

Regular expressions are case sensitive. You want:

正则表达式区分大小写。你要:

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

回答by Randolpho

It looks like you're using your regexes inappropriately. "c" would only match a lower case c, not upper case.

看起来您不恰当地使用了正则表达式。“c”只会匹配小写字母 c,而不是大写字母。

That said, I'd suggest you look into using an embedded database with full text search capabilities.

也就是说,我建议您考虑使用具有全文搜索功能的嵌入式数据库。

回答by Neal Maloney

Is that the regular expression you're using?

这是您正在使用的正则表达式吗?

The Matcher.matches() method returns true only if the whole entire input sequence matches the expression (from the Javadoc), so you would need to use "c.*"in this case, not "c*"as well as matching case insensitively.

Matcher.matches() 方法仅在整个输入序列与表达式(来自 Javadoc)匹配时才返回 true,因此您需要"c.*"在这种情况下使用,而不是"c*"不区分大小写匹配。

回答by Jay

But, hmm:

但是,嗯:

(a) Why would you use a HashMap if you intend to always search it sequentially? That's a lot of wasted overhead to process the hash keys and all when you never use them. Surely a simple ArrayList or LinkedList would be a better idea.

(a) 如果您打算始终按顺序搜索它,为什么要使用 HashMap?当你从不使用它们时,处理散列键会浪费很多开销。当然,一个简单的 ArrayList 或 LinkedList 会是一个更好的主意。

(b) What does this have to do with a thesaurus? Why would you search a thesaurus using regular expressions? If I want to know synonyms for, say, "cat", I would think that I would search for "cat", not "c.*".

(b) 这与同义词库有什么关系?为什么要使用正则表达式搜索同义词库?如果我想知道“cat”的同义词,我想我会搜索“cat”,而不是“c.*”。

My first thought on how to build a thesaurus would be ... well, I guess the first question I'd ask is, "Is synonym an equivalance relationship?", i.e. if A is a synonym for B, does it follow that B is a synonym for A? And if A is a synonym for B and B is a synonym for C, then is A a synonym for C? Assuming the answers to these questions are "yes", then what we want to build is something that divides all the words in the language into sets of synonyms, so we then can map any word in each set to all the other words in that set. So what you need is a way to take any word, map it to some sort of nexus point, and then go from that nexus point to all of the words that map to it.

我关于如何构建同义词库的第一个想法是......好吧,我想我要问的第一个问题是,“同义词是等价关系吗?”,即如果 A 是 B 的同义词,它是否遵循 B是A 的同义词吗?如果 A 是 B 的同义词,B 是 C 的同义词,那么 A 是 C 的同义词吗?假设这些问题的答案是“是”,那么我们想要构建的是将语言中的所有单词划分为同义词集的东西,因此我们可以将每个集中的任何单词映射到该集中的所有其他单词. 所以你需要的是一种方法来获取任何单词,将其映射到某种联系点,然后从该联系点转到映射到它的所有单词。

This would be straightforward on a database: Just create a table with two columns, say "word" and "token", each with its own index. All synonyms map to the same token. The token can be anything as long as its unique for any given set of synonyms, like a sequence number. Then search for the given word, find the associated token, and then get all the words with that token. For example we might create records with (big,1), (large,1), (gigantic,1), (cat,2), (feline,2), etc. Search for "big" and you get 1, then search for 1 and you get "big", "large", and "giant".

这在数据库上很简单:只需创建一个包含两列的表,比如“word”和“token”,每列都有自己的索引。所有同义词都映射到相同的标记。令牌可以是任何东西,只要它对于任何给定的同义词集都是唯一的,例如序列号。然后搜索给定的单词,找到关联的标记,然后获取带有该标记的所有单词。例如,我们可能会用 (big,1), (large,1), (gigantic,1), (cat,2), (feline,2) 等创建记录。搜索“big”,你会得到 1,然后搜索 1 会得到“big”、“large”和“giant”。

I don't know any class in the built-in Java collections that does this. The easiest way I can think of is to build two co-ordinated hash tables: One that maps words to tokens, and another that maps tokens to an array of words. So table 1 might have big->1, large->1, gigantic->1, cat->2, feline->2, etc. Then table 2 maps 1->[big,large,gigantic], 2->[cat,feline], etc. You look up in the first table to map a word to a token, and in the second to map that token back to a list of words. It's clumsy because all the data is stored redundantly, maybe there's a better solution but I'm not getting it off the top of my head. (Well, it would be easy if we assume that we're going to sequentially search the entire list of words every time, but performance would suck as the list got big.)

我不知道内置 Java 集合中的任何类可以执行此操作。我能想到的最简单的方法是构建两个协调的哈希表:一个将单词映射到标记,另一个将标记映射到单词数组。所以表 1 可能有 big->1、large->1、gigantic->1、cat->2、feline->2 等。然后表 2 映射 1->[big,large,gigantic], 2-> [cat,feline] 等。您在第一个表中查找以将单词映射到标记,然后在第二个表中将该标记映射回单词列表。这很笨拙,因为所有数据都是冗余存储的,也许有更好的解决方案,但我并没有忘记它。(好吧,如果我们假设每次都按顺序搜索整个单词列表会很容易,但是随着列表变大,性能会很差。)

回答by SomeGuy

Responding to Jay of "But Hmm" above,

回应上面“但是嗯”的杰,

(I'd add a comment but don't have the rep.)

(我会添加评论,但没有代表。)

Searching it sequentially is doing it the slow way. Doing it with regular expressions is to descend into madness. Doing it with a database is a programming cop out. Sure if your data set was massive that might be required but remember "for this assignment we're asked to use Java Collection Map" We should be figuring out the proper way to use this java collection.

按顺序搜索它是在以缓慢的方式进行。用正则表达式做这件事会陷入疯狂。用数据库做这件事是一种编程技巧。当然,如果您的数据集可能需要很大,但请记住“对于这个任务,我们被要求使用 Java 集合映射”我们应该找出使用这个 java 集合的正确方法。

The reason it isn't obvious is because it isn't one collection. It's two. But it isn't two maps. It's not an ArrayList. What's missing is a Set. It's a map to sets of synonyms.

它不明显的原因是因为它不是一个集合。是两个。但这不是两张地图。它不是一个 ArrayList。缺少的是一个集合。它是同义词集的映射。

Set<String> will let you build your lists of synonyms. You can make as many as you like. Two sets of synonyms would make a good example. It's a Set not an ArrayList because you don't want duplicate words.

Set<String> 将让您构建同义词列表。您可以制作任意数量的作品。两组同义词就是一个很好的例子。它是一个 Set 而不是 ArrayList 因为你不想要重复的单词。

Map<String, Set<String>> will let you quickly find your way from any word to its synonym set.

Map<String, Set<String>> 将让您快速找到从任何单词到其同义词集的方法。

Build your sets. Then build the map. Write a helper method to build the map that takes a map and a set.

建立你的集合。然后构建地图。编写一个辅助方法来构建接受一个映射和一个集合的映射。

addSet(Map<String, Set<String>> map, Set<String> newSet)

addSet(Map<String, Set<String>> map, Set<String> newSet)

This method just loops newSet and adds the strings to the map as keys and the reference to newSet as the value. You'd call addSet once for every set.

此方法只是循环 newSet 并将字符串作为键添加到映射中,并将对 newSet 的引用作为值添加。你会为每个集合调用一次 addSet。

Now that you're data structure is built we should be able to find stuff. To make that a little more robust, remember to clean your search key before you search. Use trim() to get rid of meaningless whitespace. Use toLowerCase() to get rid of meaningless capitalization. You should have done both of these on the synonym data before (or while) building the sets. Do that and who needs regular expressions for this? This way is much faster and more importantly safer. Regular Expressions are very powerful but can be a nightmare to debug when they go wrong. Don't use them just because you think they're cool.

现在你的数据结构已经建立,我们应该能够找到东西。为了使它更健壮,请记住在搜索之前清理您的搜索键。使用 trim() 去除无意义的空格。使用 toLowerCase() 摆脱无意义的大写。您应该在构建集合之前(或同时)对同义词数据完成这两项操作。这样做,谁需要正则表达式呢?这种方式更快,更重要的是更安全。正则表达式非常强大,但在出错时可能是调试的噩梦。不要仅仅因为你认为它们很酷就使用它们。