java 检查字符串是否包含唯一字符的最简单方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2452166/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 21:13:46  来源:igfitidea点击:

Easiest way of checking if a string consists of unique characters?

javaalgorithmstring

提问by serg

I need to check in Java if a word consists of unique letters (case insensitive). As straight solution is boring, I came up with:

如果一个单词由唯一字母组成(不区分大小写),我需要检查 Java。由于直接解决方案很无聊,我想出了:

  1. For every char in a string check if indexOf(char) == lastIndexOf(char).
  2. Add all chars to HashSetand check if set size == string length.
  3. Convert a string to a char array, sort it alphabetically, loop through array elements and check if c[i] == c[i+1].
  1. 对于字符串中的每个字符,检查是否indexOf(char) == lastIndexOf(char).
  2. 将所有字符添加到HashSet并检查是否设置大小 == 字符串长度。
  3. 将字符串转换为字符数组,按字母顺序排序,遍历数组元素并检查是否c[i] == c[i+1].

Currently I like #2 the most, seems like the easiest way. Any other interesting solutions?

目前我最喜欢#2,这似乎是最简单的方法。还有其他有趣的解决方案吗?

回答by Jerry Coffin

I don't like 1. -- it's an O(N2) algorithm. Your 2. is roughly linear, but always traverses the entire string. Your 3. is O(N lg2N), with (probably) a relatively high constant -- probably almost always slower than 2.

我不喜欢 1。——这是一个 O(N 2) 算法。你的 2. 大致是线性的,但总是遍历整个字符串。你的 3. 是 O(N lg 2N),(可能)一个相对较高的常数——可能几乎总是比 2 慢。

My preference, however, would be when you try to insert a letter into the set, check whether it was already present, and if it was, you can stop immediately. Given random distribution of letters, this should require scanning only half the string on average.

然而,我的偏好是,当您尝试将一个字母插入到集合中时,检查它是否已经存在,如果存在,您可以立即停止。鉴于字母的随机分布,这应该平均只需要扫描一半的字符串。

Edit: both comments are correct that exactly what portion of the string you expect to scan will depend on the distribution and the length -- at some point the string is long enough that a repeat is inevitable, and (for example) one character short of that, the chance is still pretty darned high. In fact, given a flat random distribution (i.e., all characters in the set are equally likely), this should fit closely with the birthday paradox, meaning the chance of a collision is related to the square root of the number of possible characters in the character set. Just for example, if we assumed basic US-ASCII (128 characters) with equal probability, we'd reach a 50% chance of a collision at around 14 characters. Of course, in real strings we could probably expect it sooner than that, since the ASCII characters aren't used with anywhere close to equal frequency in most strings.

编辑:这两条评论都是正确的,您希望扫描的字符串的确切部分将取决于分布和长度——在某些时候,字符串足够长,重复是不可避免的,并且(例如)缺少一个字符那个,机会还是相当高的。事实上,给定一个平坦的随机分布(即集合中的所有字符的可能性相等),这应该与生日悖论非常吻合,这意味着碰撞的可能性与集合中可能的字符数的平方根有关。字符集。举个例子,如果我们假设基本的 US-ASCII(128 个字符)的概率相等,我们将在 14 个字符左右时达到 50% 的碰撞机会。当然,在真正的字符串中,我们可能会比这更早,因为 ASCII 字符是'

回答by Kache

Option 2 is the best of the three - Hashing is faster than searching.

选项 2 是三者中最好的 - 散列比搜索更快。

However, there's an even faster method, if you have enough memory for it.

但是,如果您有足够的内存,还有一种更快的方法。

Take advantage of the fact that a character set is limited and already enumerated, and keep track of what's appeared and what hasn't as you check each character.

利用字符集有限且已经枚举的事实,并在检查每个字符时跟踪已出现和未出现的内容。

For example, if you're using one-byte chars, there are only 256 possibilities. You would only need 256 bits to keep track as you read through the string. If the character 0x00 occurs, flip the first bit. If the character 0x05 occurs, flip the sixth bit, and so on. When an already-flipped bit is encountered, the string isn't unique.

例如,如果您使用一字节字符,则只有 256 种可能性。当您阅读字符串时,您只需要 256 位来跟踪。如果出现字符 0x00,则翻转第一位。如果出现字符 0x05,则翻转第六位,依此类推。当遇到一个已经翻转的位时,该字符串不是唯一的。

It's worst case O(min(n, m)) where n is the length of the string, and m is the size of the character set.

最坏的情况是 O(min(n, m)),其中 n 是字符串的长度,而 m 是字符集的大小。

And of course, as I saw in another person's comment, if n > m (i.e. length of string > size of character set), then by pigeon-hole principle, there is a repeated character, determinable in O(1) time.

当然,正如我在另一个人的评论中看到的,如果 n > m(即字符串的长度 > 字符集的大小),那么根据鸽巢原理,存在重复字符,可在 O(1) 时间内确定。

回答by Matthew Flaschen

I like the HashSet idea. It's conceptually simple, and only does one pass through the string. For a simple performance improvement, check the addreturn value. One thing you should be aware of is that this works by case-folding. in one direction. You could create a wrapper class around Character with different equals semantics to really be case-insensitive.

我喜欢 HashSet 的想法。它在概念上很简单,并且只有一个通过字符串。对于简单的性能改进,请检查add返回值。您应该注意的一件事是,这是通过 case-folding 来工作的。在一个方向。您可以在 Character 周围创建一个具有不同 equals 语义的包装类,以真正不区分大小写。

Interestingly, Apache Commons has a CaseInsensitiveMap(src) which works by upper-casing then lower-casing the key. As you probably know, Java's HashSet is backed by a HashMap.

有趣的是,Apache Commons 有一个CaseInsensitiveMap( src),它通过大写然后小写键来工作。您可能知道,Java 的 HashSet 由 HashMap 支持。

public static boolean allUnique(String s)
{
  // This initial capacity can be tuned.
  HashSet<Character> hs = new HashSet<Character>(s.length());
  for(int i = 0; i < s.length(); i++)
  {
    if(!hs.add(s.charAt(i).toUpperCase())
      return false;
  }
  return true;
}

回答by Brooks Moses

By "unique letters" do you mean merely the standard English set of 26, or are you allowing interesting Unicode? What result do you expect if the string contains a non-letter?

您所说的“唯一字母”是指标准的 26 个英文字母,还是您允许使用有趣的 Unicode?如果字符串包含非字母,您期望什么结果?

If you're only considering 26 possible letters, and you want to either ignore any non-letter or consider it an automatic fail, the best algorithm is likely this pseudocode:

如果您只考虑 26 个可能的字母,并且您想忽略任何非字母或将其视为自动失败,则最佳算法可能是以下伪代码:

create present[26] as an array of booleans.
set all elements of present[] to false.
loop over characters of your string
  if character is a letter
    if corresponding element of present[] is true
      return false.
    else
      set corresponding element of present[] to true.
    end if
  else
    handle non-letters
  end if
end loop

The only remaining question is whether your array should actually be an array (requiring 26 operations to zero), or a bitfield (possibly requiring more work to check/set, but can be zeroed in a single operation). I think that the bitfield access will be pretty much comparable to the array lookup, if not faster, so I expect a bitfield is the right answer.

剩下的唯一问题是您的数组实际上应该是一个数组(需要 26 次操作归零)还是位域(可能需要更多的工作来检查/设置,但可以在单个操作中归零)。我认为位域访问将与数组查找非常相似,如果不是更快,所以我希望位域是正确的答案。

回答by Blessed Geek

public boolean hasUniqChars(String s){
  Hashset h = new Hashset();
  HashSet<Character> h = new HashSet<Character>();
  for (char c : s.toCharArray()) {
  if (!h.add(Character.toUpperCase(c))) // break if already present
    return false;
  }
  return true;
}

You should use hashset technique if you are performing char sets like utf-8 and for internationalization's sake.

如果您正在执行像 utf-8 这样的字符集并且为了国际化,您应该使用 hashset 技术。

Javadoc on Character.toUpperCase for cases of utf: This method (toUpperCase(char) ) cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the toUpperCase(int) method.

针对 utf 情况的 Character.toUpperCase 上的 Javadoc:此方法 (toUpperCase(char) ) 无法处理补充字符。要支持所有 Unicode 字符,包括增补字符,请使用 toUpperCase(int) 方法。

回答by Steve314

I would suggest a variant of (2) - use an array of "character already seen" flags instead of a hashset. As you loop through the string, exit immediately if the current character was already seen.

我会建议 (2) 的变体 - 使用“已经看到的字符”标志的数组而不是哈希集。当您遍历字符串时,如果已经看到当前字符,请立即退出。

If you have a bitvector class available (I forget whether Java provides one), you could use that, though the memory saving will not necessarily result in any speed improvement and could easily slow things down.

如果您有一个可用的 bitvector 类(我忘记 Java 是否提供了一个),您可以使用它,尽管节省内存不一定会导致任何速度提高并且很容易减慢速度。

It's O(n) worst case, though, and could have far better average performance depending on your strings - you may well find that most have a repeat near the start. In fact, strictly speaking, it's O(1) worst case anyway since a string longer than the size of the character set musthave repeated characters so you have a constant bound to the number of characters you need to check in each string.

不过,这是 O(n) 最坏的情况,并且根据您的字符串可能有更好的平均性能 - 您可能会发现大多数在开始附近都有重复。事实上,严格来说,无论如何都是 O(1) 最坏的情况,因为长度超过字符集大小的字符串必须有重复的字符,因此您有一个常量绑定到您需要在每个字符串中检查的字符数。

回答by zellio

An improvement on option 2 is to check the boolean flag that the HashSet add method returns. It's true if the object wasn't already there. Though, for this method to be at all useful you'd first have to set the string to all caps or lowercase.

选项 2 的改进是检查 HashSet add 方法返回的布尔标志。如果对象不存在,则为 true。但是,要使此方法有用,您首先必须将字符串设置为全部大写或小写。

回答by Hyman

What about using an int to store the bits corresponding to the index of the letter of the alpabhet? or maybe a long to be able to reach 64 distinct symbols.

使用 int 来存储与 alpabhet 字母索引相对应的位怎么样?或者可能需要很长时间才能达到 64 个不同的符号。

long mask;
// already lower case
string = string.toLowerCase();
for (int i = 0; i < string.length(); ++i)
{
  int index = 1 << string.charAt(i) - 'a';
  if (mask & index == index)
    return false;

  mask |= index;
}
return true;

This should be < O(n) on average case, O(n) on worst. But I'm not sure how much performant bitwise operations are in Java..

这在平均情况下应该 < O(n),在最坏情况下应该是 O(n)。但我不确定 Java 中有多少高性能的按位运算..

回答by Deven Kalra

First check if the size of string is <=26. If not , String has duplicates. return Else try adding into HashSet, if it fails, string has duplicates return. if size of HashSet is = size of string string has unique characters. If we are not allowed to use any other data structure, and string's internal methods and have to still do it in O(n), then loop thru the String.if i!=myLastIndexof(i), return Duplicates exist.

首先检查字符串的大小是否<=26。如果不是,则 String 有重复项。return 否则尝试添加到 HashSet 中,如果失败,则字符串有重复项返回。如果 HashSet 的大小是 = 字符串字符串的大小具有唯一字符。如果我们不允许使用任何其他数据结构和字符串的内部方法并且仍然必须在 O(n) 中执行,则循环遍历 String.if i!=myLastIndexof(i),返回重复项存在。

回答by codewarrior

Here's the code I wrote for Kache's answer(referred from cracking the code and modified):

这是我为 Kache 的答案编写的代码(从破解代码中引用并修改):

public boolean check() {
    int[] checker = new int[8];
    String inp = "!a~AbBC#~";
    boolean flag = true;
    if (inp.length() > 256)
        flag = false;
    else {
        for(int i=0;i<inp.length();i++) {
            int x = inp.charAt(i);
            int index = x/32;
            x = x%32;
            if((checker[index] & (1<<x)) > 0) { 
                flag = false;
                break;
            }
            else
                checker[index] = checker[index] | 1<<x;
        }
    }
    return flag;
}