Java模糊字符串与名称匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21057708/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java fuzzy String matching with names
提问by Durandal
I've got a stand-alone CSV data loading process that I coded in Java that has to use some fuzzy string matching. It's definitely not ideal, but I don't have much choice. I am matching using a first and last name and I cache all the possibilities at the start of a run. After finding the match, I then need that person object multiple places during the run. I used guava's Objects.hashCode()
to create a hash out of the firstname and lastname.
我有一个独立的 CSV 数据加载过程,我用 Java 编码,它必须使用一些模糊字符串匹配。这绝对不理想,但我没有太多选择。我使用名字和姓氏进行匹配,并在运行开始时缓存所有可能性。找到匹配项后,我需要该人在运行期间对象多个位置。我使用番石榴Objects.hashCode()
创建了名字和姓氏的哈希值。
The caching mechanism looks like this:
缓存机制如下所示:
Map<Integer,PersonDO> personCache = Maps.newHashMap();
for(PersonDO p: dao.getPeople()) {
personCache.put(Objects.hashCode(p.getFirstName(),p.getLastName()), p);
}
Most of the time I get hits on firstname+lastname, but when it misses I'm falling back using Apache's StringUtils.getLevenshteinDistance()
to try and match it. This is how the matching logic flow goes:
大多数情况下,我会遇到名字+姓氏的点击,但是当它错过时,我会退回使用 ApacheStringUtils.getLevenshteinDistance()
来尝试匹配它。这是匹配逻辑流程的过程:
person = personCache.get(Objects.hashCode(firstNameFromCSV,lastNameFromCSV));
if(person == null) {//fallback to fuzzy matching
person = findClosetMatch(firstNameFromCSV+lastNameFromCSV);
}
This is the findClosetMatch()
method:
这是findClosetMatch()
方法:
private PersonDO findClosetMatch(String name) {
int min = 15;//initial value
int testVal=0;
PersonDO matchedPerson = null;
for(PersonDO person: personCache.values()) {
testVal = StringUtils.getLevenshteinDistance(name,person.getFirstName()+person.getLastName());
if( testVal < min ) {
min = testVal;
matchedPerson = person;
}
}
if(matchedPerson == null) {
throw new Exception("Unable to find person: " + name)
}
return matchedPerson;
}
This works fine with simple spelling errors, typos and shortened names(ie Mike->Michael), but when I'm completely missing one of the incoming names in the cache I end up returning a false positive match. To help keep this from happening I set the min value in findClosetMatch()
to 15(ie no more than 15 chars off); it works most of the time but I've still had a few mis-matches occur: Mike Thompson
hits on Mike Thomas
etc.
这适用于简单的拼写错误、拼写错误和缩短的名称(即 Mike->Michael),但是当我完全缺少缓存中的传入名称之一时,我最终会返回误报匹配。为了防止这种情况发生,我将最小值设置findClosetMatch()
为 15(即不超过 15 个字符);它大部分时间都有效,但我仍然有一些不匹配的情况发生:Mike Thompson
命中等Mike Thomas
。
Outside of figuring out a way to get a primary key into the file being loaded, does anyone see a way to improve this process? Any other matching algorithms that could help here?
除了找出一种将主键添加到正在加载的文件中的方法之外,有没有人看到改进此过程的方法?任何其他匹配算法可以在这里提供帮助?
采纳答案by Nicole
As I look at this problem I notice a couple key facts to base some improvements on:
当我看到这个问题时,我注意到一些关键事实可以作为一些改进的基础:
Facts and observations
事实和观察
- Max iterations of 1000.
- 15 for Levenshtein distance sounds reallyhigh to me.
- You know, by observing the data empirically, what your fuzzy matching should look like (there are many cases for fuzzy matching and each depends on whythe data is bad).
- By building this API-like you could plug in many algorithms, including your own and others like Soundex, instead of depending on just one.
- 最大迭代次数为 1000。
- Levenshtein 距离 15 对我来说听起来真的很高。
- 你知道,通过经验观察数据,你的模糊匹配应该是什么样子的(模糊匹配有很多情况,每个都取决于数据为什么不好)。
- 通过构建这种类似 API,您可以插入许多算法,包括您自己的算法和其他算法(例如Soundex ),而不是仅仅依赖一种算法。
Requirements
要求
I've interpreted your problem as requiring the following two things:
我已经将您的问题解释为需要以下两件事:
- You have
PersonDO
objects that you want to look up via a key that is based on the name. It sounds like you want to do this because you need a pre-existingPersonDO
of which one exists per unique name, and the same name may come up more than once in your loop/workflow. - You need "fuzzy matching" because the incoming data is not pure. For the purposes of this algorithm we'll assume that if a name "matches", it should always use the same
PersonDO
(in other words, a person's unique identifier is their name, which is obviously not the case in real life, but seems to work for you here).
- 您有
PersonDO
想要通过基于名称的键查找的对象。听起来您想这样做是因为您需要预先PersonDO
存在每个唯一名称中的一个,并且相同的名称可能会在您的循环/工作流程中出现不止一次。 - 您需要“模糊匹配”,因为传入的数据不纯。为了这个算法的目的,我们假设如果一个名字“匹配”,它应该总是使用相同的
PersonDO
(换句话说,一个人的唯一标识符是他们的名字,这在现实生活中显然不是这种情况,但似乎在这里为你工作)。
Implementation
执行
Next, let's look at making some improvements on your code:
接下来,让我们看看对您的代码进行一些改进:
1. Cleanup: unnecessary hashcode manipulation.
1.清理:不必要的哈希码操作。
You don't need to generate hash codes yourself. This confuses the issue a bit.
您不需要自己生成哈希码。这有点混淆了这个问题。
You're simply generating a hashcode for the combination of the firstname + lastname. This is exactly what HashMap
would do if you gave it the concatenated string as a key. So, just do that (and add a space, just in case we want to reverse parse out first/last from the key later).
您只需为名字 + 姓氏的组合生成一个哈希码。HashMap
如果您将连接的字符串作为键,这正是会执行的操作。所以,就这样做(并添加一个空格,以防万一我们想稍后从键中反向解析第一个/最后一个)。
Map<String, PersonDO> personCache = Maps.newHashMap();
public String getPersonKey(String first, String last) {
return first + " " + last;
}
...
// Initialization code
for(PersonDO p: dao.getPeople()) {
personCache.put(getPersonKey(p.getFirstName(), p.getLastName()), p);
}
2. Cleanup: Build a retrieval function to perform the lookup.
2.清理:构建一个检索函数来执行查找。
Since we've changed the key in the map we need to change the lookup function. We'll build this like a mini-API. If we always knew the key exactly (i.e. unique IDs) we would of course just use Map.get
. So we'll start with that, but since we know we will need to add fuzzy matching we'll add a wrapper where this can happen:
由于我们更改了地图中的键,因此我们需要更改查找功能。我们将把它构建成一个迷你 API。如果我们总是确切地知道密钥(即唯一 ID),我们当然会使用Map.get
. 所以我们将从这个开始,但是因为我们知道我们需要添加模糊匹配,所以我们将在可能发生的情况下添加一个包装器:
public PersonDO findPersonDO(String searchFirst, String searchLast) {
return personCache.get(getPersonKey(searchFirst, searchLast));
}
3. Build a fuzzy matching algorithm yourself using scoring.
3. 使用评分自己构建模糊匹配算法。
Note that since you are using Guava, I've used a few conveniences here (Ordering
, ImmutableList
, Doubles
, etc.).
请注意,由于您使用的是番石榴,我用了几个便利这里(Ordering
,ImmutableList
,Doubles
等)。
First, we want to preserve the work we do to figure out how close a match is. Do this with a POJO:
首先,我们希望保留我们所做的工作,以确定匹配的接近程度。使用 POJO 执行此操作:
class Match {
private PersonDO candidate;
private double score; // 0 - definitely not, 1.0 - perfect match
// Add candidate/score constructor here
// Add getters for candidate/score here
public static final Ordering<Match> SCORE_ORDER =
new Ordering<Match>() {
@Override
public int compare(Match left, Match right) {
return Doubles.compare(left.score, right.score);
}
};
}
Next, we create a method for scoring a generic name. We should score first and last names separately, because it reduces noise. For instance, we don't care if the firstname matches any part of the last name — unless your firstname might accidentally be in the lastname field or vice versa, which you should account for intentionally and not accidentally (we'll address this later).
接下来,我们创建一个对通用名称进行评分的方法。我们应该分别对名字和姓氏进行评分,因为它可以减少噪音。例如,我们不关心名字是否与姓氏的任何部分匹配 -除非您的名字可能不小心出现在姓氏字段中,反之亦然,您应该有意而不是无意地考虑到这一点(我们稍后会解决这个问题).
Note that we no longer need a "max levenshtein distance". This is because we normalize them to length, and we will pick the closest match later.15 character adds/edits/deletes seems very high, and since we've minimized the blank first/last name problem by scoring names separately, we could probably now pick a max of 3-4 if you wanted (scoring anything else as a 0).
请注意,我们不再需要“最大编辑距离”。这是因为我们将它们标准化为长度,稍后我们将选择最接近的匹配项。15 个字符的添加/编辑/删除似乎非常高,而且由于我们通过分别对姓名进行评分来最大限度地减少空白名字/姓氏问题,如果您愿意,我们现在可能最多选择 3-4 个(将其他任何内容评分为 0 )。
// Typos on first letter are much more rare. Max score 0.3
public static final double MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH = 0.3;
public double scoreName(String searchName, String candidateName) {
if (searchName.equals(candidateName)) return 1.0
int editDistance = StringUtils.getLevenshteinDistance(
searchName, candidateName);
// Normalize for length:
double score =
(candidateName.length() - editDistance) / candidateName.length();
// Artificially reduce the score if the first letters don't match
if (searchName.charAt(0) != candidateName.charAt(0)) {
score = Math.min(score, MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH);
}
// Try Soundex or other matching here. Remember that you don't want
// to go above 1.0, so you may want to create a second score and
// return the higher.
return Math.max(0.0, Math.min(score, 1.0));
}
As noted above, you could plug-in third party or other word-matching algorithms and gain from the shared knowledge of all of them.
如上所述,您可以插入第三方或其他单词匹配算法,并从所有这些算法的共享知识中获益。
Now, we go through the whole list and score every name. Note that I've added a spot for "tweaks". Tweaks may include:
现在,我们遍历整个列表并对每个名称进行评分。请注意,我为“调整”添加了一个位置。调整可能包括:
- Reversal: If the PersonDO is "Benjamin Franklin", but the CSV sheet may contain "Franklin, Benjamin", then you will want to correct for reversed names. In this case you will probably want to add a method
checkForReversal
that would score the name in reverse and take that score if it is significantly higher. If it matched in reverse exactly, you would give it a 1.0 score. - Abbreviations: You may want to give the score a bonus bump if either first/last names matches identically and the other is fully containedin the candidate (or vice versa). This might indicate an abbreviation, like "Samantha/Sam".
- Common nicknames: You could add a set of known nicknames ("Robert -> Bob, Rob, Bobby, Robby") and then score the search name against all of them and take the highest score. If it matches any of these, you would probably give it a 1.0 score.
- 反转:如果 PersonDO 是“Benjamin Franklin”,但 CSV 表可能包含“Franklin, Benjamin”,那么您需要更正反转的名称。在这种情况下,您可能希望添加一种方法
checkForReversal
,该方法可以反向对名称进行评分,如果该评分明显更高,则采用该评分。 如果它完全反向匹配,你会给它一个 1.0 score。 - 缩写:如果名字/姓氏中的一个完全匹配并且另一个完全包含在候选人中(反之亦然),您可能希望给分数一个额外的奖励。这可能表示缩写,例如“Samantha/Sam”。
- 常用昵称:您可以添加一组已知昵称(“Robert -> Bob, Rob, Bobby, Robby”),然后根据所有昵称对搜索名称进行评分并取最高分。如果它与其中任何一个匹配,您可能会给它 1.0 分。
As you can see building this as a series of API's gives us logical locations to easily tweak this to our heart's content.
正如您所看到的,将其构建为一系列 API 为我们提供了逻辑位置,可以轻松地将其调整为我们的核心内容。
On with the alogrithm:
继续算法:
public static final double MIN_SCORE = 0.3;
public List<Match> findMatches(String searchFirst, String searchLast) {
List<Match> results = new ArrayList<Match>();
// Keep in mind that this doesn't scale well.
// With only 1000 names that's not even a concern a little bit, but
// thinking ahead, here are two ideas if you need to:
// - Keep a map of firstnames. Each entry should be a map of last names.
// Then, only iterate through last names if the firstname score is high
// enough.
// - Score each unique first or last name only once and cache the score.
for(PersonDO person: personCache.values()) {
// Some of my own ideas follow, you can tweak based on your
// knowledge of the data)
// No reason to deal with the combined name, that just makes things
// more fuzzy (like your problem of too-high scores when one name
// is completely missing).
// So, score each name individually.
double scoreFirst = scoreName(searchFirst, person.getFirstName());
double scoreLast = scoreName(searchLast, person.getLastName());
double score = (scoreFirst + scoreLast)/2.0;
// Add tweaks or alternate scores here. If you do alternates, in most
// cases you'll probably want to take the highest, but you may want to
// average them if it makes more sense.
if (score > MIN_SCORE) {
results.add(new Match(person, score));
}
}
return ImmutableList.copyOf(results);
}
Now we modify your findClosestMatch to get just the highest one from all matches (throws NoSuchElementException
if none in list).
现在我们修改您的 findClosestMatch 以从所有匹配项中获取最高的一个(NoSuchElementException
如果列表中没有则抛出)。
Possible tweaks:
可能的调整:
- You may want to check if multiple names scored very close, and either report the runners-up (see below), or skip the row for manual choice later.
- You may want to report how many other matches there were (if you have a very tight scoring algorithm).
- 您可能需要检查多个名字的得分是否非常接近,然后报告亚军(见下文),或跳过该行以便稍后手动选择。
- 您可能想要报告有多少其他匹配项(如果您有一个非常严格的评分算法)。
Code:
代码:
public Match findClosestMatch(String searchFirst, String searchLast) {
List<Match> matches = findMatch(searchFirst, searchLast);
// Tweak here
return Match.SCORE_ORDER.max(list);
}
.. and then modify our original getter:
.. 然后修改我们原来的 getter:
public PersonDO findPersonDO(String searchFirst, String searchLast) {
PersonDO person = personCache.get(getPersonKey(searchFirst, searchLast));
if (person == null) {
Match match = findClosestMatch(searchFirst, searchLast);
// Do something here, based on score.
person = match.getCandidate();
}
return person;
}
4. Report "fuzziness" differently.
4. 以不同的方式报告“模糊性”。
Finally, you'll notice that findClosestMatch
doesn't just return a person, it returns a Match
— This is so that we can modify the program to treat fuzzy matches differently from exact matches.
最后,您会注意到,findClosestMatch
它不仅返回一个人,还返回一个Match
— 这样我们就可以修改程序,以区别对待模糊匹配和精确匹配。
Some things you probably want to do with this:
你可能想用这个做一些事情:
- Report guesses:Save all names that matched based on fuzziness into a list so that you can report those and they can be audited later.
- Validate first:You may want to add a control to turn on and off whether it actually uses the fuzzy matches or just reports them so that you can massage the data before it comes in.
- Data defenesiveness:You may want to qualify any edits made on a fuzzy match as "uncertain". For example, you could disallow any "major edits" to a Person record if the match was fuzzy.
- 报告猜测:将所有根据模糊性匹配的名称保存到列表中,以便您可以报告这些名称,以便稍后对其进行审核。
- 首先验证:您可能想要添加一个控件来打开和关闭它是否实际使用模糊匹配或只是报告它们,以便您可以在数据进入之前对其进行处理。
- 数据防御性:您可能希望将在模糊匹配上所做的任何编辑限定为“不确定”。例如,如果匹配模糊,您可以禁止对 Person 记录进行任何“主要编辑”。
Conclusion
结论
As you can see it's not too much code to do this yourself. It's doubtful there is ever going to be a library that will predict names as well as you can knowing the data yourself.
正如您所看到的,自己编写代码并不多。令人怀疑的是,是否会出现一个可以预测名称以及您自己可以了解数据的库。
Building this in pieces as I've done in the example above will allow you to iterate and tweak easilyand even plug in third-party libraries to improve your scoring instead of depending on them entirely -- faults and all.
像我在上面的示例中所做的那样,将其分块构建将允许您轻松迭代和调整,甚至插入第三方库来提高您的得分,而不是完全依赖它们——错误等等。
回答by Giorgio Desideri
Use you db to perform search ? Using regular expression in your select, or use
LIKE
operatorAnalyze your database and try to build or Huffman-tree or Several table to perform a Key-value search.
使用你的数据库来执行搜索?在选择中使用正则表达式,或使用
LIKE
运算符分析您的数据库并尝试构建或 Huffman-tree 或几个表来执行键值搜索。
回答by Alexey Andreev
There is no the best solution, anyhow you have to deal with some sort of heuristics. But you can look for another Levenshtein distance implementation (or implement it by yourself). This implementation must give different scores to different character operations (insertion, deletion) for different characters. For example, you can give lower scores for pairs of characters that are close on the keyboard. Also, you may dynamically compute maximum distance threshold based on a string length.
没有最好的解决方案,无论如何你必须处理某种启发式方法。但是您可以寻找另一个 Levenshtein 距离实现(或自己实现)。这个实现必须对不同字符的不同字符操作(插入、删除)给出不同的分数。例如,您可以为键盘上接近的字符对给出较低的分数。此外,您可以根据字符串长度动态计算最大距离阈值。
And I have a performance tip for you. Each time you compute Levenshtein distance, n * m operations are performed, where nand mare lengths of strings. There is Levenshtein automatonwhich you build once and then evaluate very fast for each string. Be careful, as NFA is very expensive to evaluate, you need first convert it to DFA.
我有一个性能提示给你。每次计算 Levenshtein 距离时,都会执行 n * m 次操作,其中n和m是字符串的长度。有Levenshtein 自动机,您可以构建一次,然后对每个字符串进行非常快速的评估。请注意,由于 NFA 的评估成本非常高,您需要先将其转换为 DFA。
May be you should take a look to Lucene. I hope it includes all fuzzy search capabilities you need. Of you even can use your DBMS fulltext search, if it supported. For example, PostgreSQL supports fulltext.
也许你应该看看Lucene。我希望它包含您需要的所有模糊搜索功能。如果支持,您甚至可以使用 DBMS 全文搜索。例如,PostgreSQL 支持全文。
回答by Andrey Chaschev
This is what I did with a similar use case:
这是我对类似用例所做的:
- Match the first name and the last name separately, this will do a more precise match and eliminate some of the false positives:
- 分别匹配名字和姓氏,这将进行更精确的匹配并消除一些误报:
distance("a b", "a c") is 33% max(distance("a", "a"), distance("b", "c")) is 100%
- Base your
min
distance criteria on the input strings' length, i.e.0
is for strings shorter than 2 symbols,1
is for strings shorter than 3 symbols.
- 基的
min
所述输入字符串长度的距离的标准,即0
是用于字符串长度超过2个符号短,1
为字符串长度超过3个符号短。
int length = Math.min(s1.length(), s2.length);
int min;
if(length <= 2) min = 0; else
if(length <= 4) min = 1; else
if(length <= 6) min = 2; else
...
These two should work for your input.
这两个应该适用于您的输入。