Java 检测字符串是否具有唯一字符:将我的解决方案与“破解编码面试”进行比较?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19484406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 17:38:20  来源:igfitidea点击:

Detecting if a string has unique characters: comparing my solution to "Cracking the Coding Interview?"

javastringalgorithmbig-otime-complexity

提问by Seephor

I am working through the book "Cracking the Coding Interview" and I have come across questions here asking for answers, but I need help comparing my answer to the solution. My algorithm works, but I am having difficulty understand the solution in the book. Mainly because I don't understand what some of the operators are really doing.

我正在阅读“破解编码面试”这本书,在这里我遇到了一些需要答案的问题,但我需要帮助将我的答案与解决方案进行比较。我的算法有效,但我很难理解书中的解决方案。主要是因为我不明白一些运营商真正在做什么。

The task is: "Implement an algorithm to determine if a string has all unique characters. What if you cannot use additional data structures?"

任务是:“实现一个算法来确定一个字符串是否包含所有唯一的字符。如果你不能使用额外的数据结构怎么办?”

This is my solution:

这是我的解决方案:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            containsUnique = false;
        }
    }

    return containsUnique;
}

It works, but how efficient is this? I saw that the complexity of the index functions for String in Java are O(n*m)

它有效,但它的效率如何?我看到Java中String的索引函数的复杂度是O(n*m)

Here is the solution from the book:

这是书中的解决方案:

public static boolean isUniqueChars(String str) {
    if (str.length() > 256) {
        return false;
    }
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) - 'a';
        if ((checker & (1 << val)) > 0) return false;
        checker |= (1 << val);
    }
    return true;
}

A couple things I am not quite understanding with the solution. First, what does the "|=" operator do? Why is 'a' subtracted from the current character in the string for the value of "val"? I know "<<" is a bitwise left shift, but what does (checker & (1<<val))do? I know it is bitwise and, but I am not understanding it since I am not understanding the line where checker gets a value.

我对解决方案不太了解的几件事。首先,“|=”运算符有什么作用?为什么从字符串中的当前字符中减去“a”以获取“val”的值?我知道“<<”是按位左移,但有什么作用(checker & (1<<val))呢?我知道它是按位的,但我不理解它,因为我不理解 checker 获取值的行。

I am just not familiar with these operations and unfortunately the book does not give an explanation of the solutions, probably because it assumes you already understand these operations.

我只是不熟悉这些操作,不幸的是这本书没有给出解决方案的解释,可能是因为它假设您已经了解这些操作。

采纳答案by templatetypedef

There are two separate questions here: what's the efficiency of your solution, and what is the reference solution doing? Let's treat each independently.

这里有两个单独的问题:您的解决方案的效率如何,参考解决方案在做什么?让我们独立对待每个人。

First, your solution:

首先,您的解决方案:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            containsUnique = false;
        }
    }

    return containsUnique;
}

Your solution essentially consists of a loop over all characters in the string (let's say there are n of them), checking on each iteration whether the first and last index of the characters are the same. The indexOfand lastIndexOfmethods each take time O(n), because they have to scan across all the characters of the string to determine if any of them match the one you're looking for. Therefore, since your loop runs O(n) times and does O(n) work per iteration, its runtime is O(n2).

您的解决方案基本上包括对字符串中所有字符的循环(假设有 n 个),检查每次迭代中字符的第一个和最后一个索引是否相同。该indexOflastIndexOf方法都需要时间为O(n),因为他们有所有扫描字符串的字符,以确定其中是否匹配你要找的人。因此,由于您的循环运行 O(n) 次并且每次迭代执行 O(n) 工作,因此其运行时间为 O(n 2)。

However, there's something iffy about your code. Try running it on the string aab. Does it work correctly on this input? As a hint, as soon as you determine that there are two or more duplicated characters, you're guaranteed that there are duplicates and you can return that not all characters are unique.

但是,您的代码有些不确定。尝试在 string 上运行它aab。它在这个输入上工作正常吗?作为提示,一旦您确定存在两个或更多重复字符,就可以保证存在重复字符,并且您可以返回并非所有字符都是唯一的。

Now, let's look at the reference:

现在,让我们看看参考:

public static boolean isUniqueChars(String str) {
    if (str.length() > 256) { // NOTE: Are you sure this isn't 26?
        return false;
    }
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) - 'a';
        if ((checker & (1 << val)) > 0) return false;
        checker |= (1 << val);
    }
    return true;
}

This solution is cute. The basic idea is the following: imagine that you have an array of 26 booleans, each one tracking whether a particular character has appeared in the string already. You start with all of them false. You then iterate across the characters of the string, and each time you see a character you look into the array slot for that character. If it's false, this is the first time you've seen the character and you can set the slot to true. If it's true, you've already seen this character and you can immediately report that there's a duplicate.

这个解决方案很可爱。基本思想如下:假设您有一个包含 26 个布尔值的数组,每个布尔值都跟踪某个特定字符是否已经出现在字符串中。你一开始所有的都是假的。然后遍历字符串的字符,每次看到一个字符时,都会查看该字符的数组槽。如果是false,这是您第一次看到该角色,您可以将插槽设置​​为true。如果是true,则您已经看到了此字符,并且可以立即报告存在重复项。

Notice that this method doesn't allocate an array of booleans. Instead, it opts for a clever trick. Since there are only 26 different characters possible and there are 32 bits in an int, the solution creates an intvariable where each bit of the variable corresponds to one of the characters in the string. Instead of reading and writing an array, the solution reads and writes the bits of the number.

请注意,此方法不分配布尔数组。相反,它选择了一个聪明的技巧。由于可能只有 26 个不同的字符,并且 a 中有 32 位int,因此该解决方案创建了一个int变量,其中变量的每一位对应于字符串中的一个字符。该解决方案不是读取和写入数组,而是读取和写入数字的位。

For example, look at this line:

例如,看看这一行:

if ((checker & (1 << val)) > 0) return false;

What does checker & (1 << val)do? Well, 1 << valcreates an intvalue that has all bits zero except for the valth bit. It then uses bitwise AND to AND this value with checker. If the bit at position valin checkeris already set, then this evaluates to a nonzero value (meaning we've already seen the number) and we can return false. Otherwise, it evaluates to 0, and we haven't seen the number.

有什么作用checker & (1 << val)?好吧,1 << val创建一个int值,除了val第 th 位之外,所有位都为零。然后它使用按位 AND 将该值与checker. 如果位置valin的位checker已经设置,那么它的计算结果为非零值(意味着我们已经看到了数字)并且我们可以返回 false。否则,它的计算结果为 0,我们还没有看到这个数字。

The next line is this:

下一行是这样的:

checker |= (1 << val);

This uses the "bitwise OR with assignment" operator, which is equivalent to

这使用了“按位或赋值”运算符,相当于

checker = checker | (1 << val);

This ORs checkerwith a value that has a 1 bit set only at position val, which turns the bit on. It's equivalent to setting the valth bit of the number to 1.

此 ORchecker与仅在位置 设置 1 位的值进行OR 运算val,从而打开该位。相当于将数字的val第 th 位设置为 1。

This approach is much faster than yours. First, since the function starts off by checking if the string has length greater than 26 (I'm assuming the 256 is a typo), the function never has to test any string of length 27 or greater. Therefore, the inner loop runs at most 26 times. Each iteration does O(1) work in bitwise operations, so the overall work done is O(1) (O(1) iterations times O(1) work per iteration), which is significantlyfaster than your implementation.

这种方法比你的快得多。首先,由于函数首先检查字符串的长度是否大于 26(我假设 256 是一个错字),因此函数永远不必测试任何长度为 27 或更大的字符串。因此,内循环最多运行 26 次。每次迭代在按位运算中执行 O(1) 工作,因此完成的总体工作为 O(1)(每次迭代 O(1) 迭代乘以 O(1) 工作),这明显快于您的实现。

If you haven't seen bitwise operations used this way, I'd recommend searching for "bitwise operators" on Google to learn more.

如果您还没有见过以这种方式使用的按位运算,我建议您在 Google 上搜索“按位运算符”以了解更多信息。

Hope this helps!

希望这可以帮助!

回答by Adam Liss

Since a charvalue can hold one of only 256 different values, any string that's longer than 256 characters mustcontain at least one duplicate.

由于一个char值只能包含 256 个不同值之一,因此任何长度超过 256 个字符的字符串必须至少包含一个重复项。

The remainder of the code uses checkeras a sequence of bits, with each bit representing one character. It seems to convert each character to an integer, starting with a= 1. It then checks the corresponding bit in checker. If it's set, it means that character has already been seen, and we therefore know that the string contains at least one duplicate character. If the character hasn't yet been seen, the code sets the corresponding bit in checkerand continues.

代码的其余部分checker用作位序列,每一位代表一个字符。它似乎将每个字符转换为整数,从a= 1开始。然后检查checker. 如果设置了,则表示该字符已被看到,因此我们知道该字符串至少包含一个重复字符。如果尚未看到该字符,则代码将相应的位置入checker并继续。

Specifically, (1<<val)generates an integer with a single 1bit in position val. For example, (1<<3)would be binary 1000, or 8. The expression checker & (1<<val)will return zero if the bit in position valis not set (that is, has value 0) in checker, and (1<<val), which is always non-zero, if that bit is set. The expression checker |= (1<<val)will set that bit in checker.

具体来说,(1<<val)生成一个整数,1位置为一位val。例如,(1<<3)将二进制的1000,或8中的表达checker & (1<<val)将返回零,如果在位置的位val没有被设置(即,具有值0) checker,并且(1<<val),这始终是非零值,如果该位被置位。该表达式checker |= (1<<val)将在checker.

However, the algorithm seems to be flawed: it doesn't seem to account for the uppercase characters and punctuation (which generally come before the lowercase ones lexicographically). It would also seem to require a 256-bit integer, which is not standard.

但是,该算法似乎有缺陷:它似乎没有考虑大写字符和标点符号(按字典顺序通常在小写字符之前)。它似乎也需要一个 256 位整数,这不是标准的。

As rolflmentions in the comment below, I prefer your solution because it works. You can optimize it by returning falseas soon as you identify a non-unique character.

正如rolfl在下面的评论中提到的,我更喜欢你的解决方案,因为它有效。您可以通过false在确定非唯一字符后立即返回来优化它。

回答by rolfl

The book solution is one I don't like and I believe is dysfunctional..... templatetypedef has posted a comprehensive answer which indicates that the solution is a good one. I disagree, since the book's answer assumes that the string only has lower-case characters, (ascii) and does no validation to ensure that.

这本书的解决方案是我不喜欢的,我认为它是功能失调的..... templatetypedef 发布了一个全面的答案,表明该解决方案是一个很好的解决方案。我不同意,因为这本书的答案假定字符串只有小写字符(ascii)并且没有验证来确保这一点。

public static boolean isUniqueChars(String str) {
    // short circuit - supposed to imply that
    // there are no more than 256 different characters.
    // this is broken, because in Java, char's are Unicode,
    // and 2-byte values so there are 32768 values
    // (or so - technically not all 32768 are valid chars)
    if (str.length() > 256) {
        return false;
    }
    // checker is used as a bitmap to indicate which characters
    // have been seen already
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        // set val to be the difference between the char at i and 'a'
        // unicode 'a' is 97
        // if you have an upper-case letter e.g. 'A' you will get a
        // negative 'val' which is illegal
        int val = str.charAt(i) - 'a';
        // if this lowercase letter has been seen before, then
        // the corresponding bit in checker will have been set and
        // we can exit immediately.
        if ((checker & (1 << val)) > 0) return false;
        // set the bit to indicate we have now seen the letter.
        checker |= (1 << val);
    }
    // none of the characters has been seen more than once.
    return true;
}

The bottom line, given templatedef's answer too, is that there's not actually enough information to determine whether the book's answer is right.

最重要的是,给出了 templatedef 的答案,实际上没有足够的信息来确定这本书的答案是否正确。

I distrust it though.

我不相信它。

templatedef's answer on the complexity is one I agree with though.... ;-)

templatedef 对复杂性的回答是我同意的…… ;-)

EDIT: As an exercise, I converted the book's answer to one that will work (albeit slower than the book's answer - BigInteger is slow).... This version does the same logic as the book's, but does not have the same validation and assumption problems (but it is slower). It is useful to show the logic too.

编辑:作为练习,我将这本书的答案转换为一个可行的答案(尽管比这本书的答案慢 - BigInteger 很慢)......这个版本与本书的逻辑相同,但没有相同的验证和假设问题(但速度较慢)。显示逻辑也很有用。

public static boolean isUniqueChars(String str) {
    if (str.length() > 32768) {
        return false;
    }
    BigInteger checker = new BigInteger(0);
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i);
        if (checker.testBit(val)) return false;
        checker = checker.setBit(val);
    }
    // none of the characters has been seen more than once.
    return true;
}

回答by wolfie

This is the necessary correction to the book's code:

这是对本书代码的必要更正:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            return false;
        }
    }

    return containsUnique;
}

回答by allen joseph

The solution from the book is case insensitive. 'A' and 'a' is considered duplicate as per the implementation.

书中的解决方案不区分大小写。根据实现,“A”和“a”被认为是重复的。

Explanation: for input string with char 'A', 'A' - 'a' is -32 so '1 << val' will be evaluated as 1 << -32. shift on any negative number will shift the bits in opposite direction. thus 1 << -32 will be 1 >> 32. Which will set the first bit to 1. This is also the case with char 'a'. Thus 'A' and 'a' are considered duplicate characters. Similarly for 'B' and 'b' second bit is set to 1 and so on.

说明:对于带有字符 'A' 的输入字符串,'A' - 'a' 是 -32,因此 '1 << val' 将被评估为 1 << -32。shift 任何负数都会向相反的方向移动位。因此 1 << -32 将是 1 >> 32。这会将第一位设置为 1。这也是 char 'a' 的情况。因此,'A' 和 'a' 被视为重复字符。同样,对于 'B' 和 'b' 第二位设置为 1,依此类推。

回答by Jim West

6th edition update

第六版更新

    public static void main(String[] args) {
        System.out.println(isUniqueChars("abcdmc")); // false
        System.out.println(isUniqueChars("abcdm")); // true
        System.out.println(isUniqueChars("abcdm\u0061")); // false because \u0061 is unicode a
    }


    public static boolean isUniqueChars(String str) {
        /*
         You should first ask your interviewer if the string is an ASCII string or a Unicode string.
         Asking this question will show an eye for detail and a solid foundation in computer science.
         We'll assume for simplicity the character set is ASCII.
         If this assumption is not valid, we would need to increase the storage size.
         */
        // at 6th edition of the book, there is no pre condition on string's length
        /*
         We can reduce our space usage by a factor of eight by using a bit vector.
         We will assume, in the below code, that the string only uses the lowercase letters a through z.
         This will allow us to use just a single int.
          */
        // printing header to provide nice csv format log, you may uncomment
//        System.out.println("char,val,valBinaryString,leftShift,leftShiftBinaryString,checker");
        int checker = 0;
        for (int i = 0; i < str.length(); i++) {
            /*
                Dec Binary Character
                97  01100001    a
                98  01100010    b
                99  01100011    c
                100 01100100    d
                101 01100101    e
                102 01100110    f
                103 01100111    g
                104 01101000    h
                105 01101001    i
                106 01101010    j
                107 01101011    k
                108 01101100    l
                109 01101101    m
                110 01101110    n
                111 01101111    o
                112 01110000    p
                113 01110001    q
                114 01110010    r
                115 01110011    s
                116 01110100    t
                117 01110101    u
                118 01110110    v
                119 01110111    w
                120 01111000    x
                121 01111001    y
                122 01111010    z
             */
            // a = 97 as you can see in ASCII table above
            // set val to be the difference between the char at i and 'a'
            // b = 1, d = 3.. z = 25
            char c = str.charAt(i);
            int val = c - 'a';
            // means "shift 1 val numbers places to the left"
            // for example; if str.charAt(i) is "m", which is the 13th letter, 109 (g in ASCII) minus 97 equals 12
            // it returns 1 and 12 zeros = 1000000000000 (which is also the number 4096)
            int leftShift = 1 << val;
            /*
                An integer is represented as a sequence of bits in memory.
                For interaction with humans, the computer has to display it as decimal digits, but all the calculations
                are carried out as binary.
                123 in decimal is stored as 1111011 in memory.

                The & operator is a bitwise "And".
                The result is the bits that are turned on in both numbers.

                1001 & 1100 = 1000, since only the first bit is turned on in both.

                It will be nicer to look like this

                1001 &
                1100
                =
                1000

                Note that ones only appear in a place when both arguments have a one in that place.

             */
            int bitWiseAND = checker & leftShift;
            String leftShiftBinaryString = Integer.toBinaryString(leftShift);
            String checkerBinaryString = leftPad(Integer.toBinaryString(checker), leftShiftBinaryString.length());
            String leftShiftBinaryStringWithPad = leftPad(leftShiftBinaryString, checkerBinaryString.length());
//            System.out.printf("%s &\n%s\n=\n%s\n\n", checkerBinaryString, leftShiftBinaryStringWithPad, Integer.toBinaryString(bitWiseAND));
            /*
            in our example with string "abcdmc"
            0 &
            1
            =
            0

            01 &
            10
            =
            0

            011 &
            100
            =
            0

            0111 &
            1000
            =
            0

            0000000001111 &
            1000000000000
            =
            0

            1000000001111 &
            0000000000100
            =
            100
             */
//            System.out.println(c + "," + val + "," + Integer.toBinaryString(val) + "," + leftShift + "," + Integer.toBinaryString(leftShift) + "," + checker);
            /*
            char val valBinaryString leftShift leftShiftBinaryString checker
            a   0       0               1       1                       0
            b   1       1               2       10                      1
            c   2       10              4       100                     3
            d   3       11              8       1000                    7
            m   12      1100            4096    1000000000000           15
            c   2       10              4       100                     4111
             */
            if (bitWiseAND > 0) {
                return false;
            }
            // setting 1 on on the left shift
            /*
            0000000001111 |
            1000000000000
            =
            1000000001111
             */
            checker = checker | leftShift;
        }
        return true;
        /*
        If we can't use additional data structures, we can do the following:
        1. Compare every character of the string to every other character of the string.
            This will take 0( n 2 ) time and 0(1) space
        2. If we are allowed to modify the input string, we could sort the string in O(n log(n)) time and then linearly
            check the string for neighboring characters that are identical.
            Careful, though: many sorting algorithms take up extra space.

        These solutions are not as optimal in some respects, but might be better depending on the constraints of the problem.
         */
    }

    private static String leftPad(String s, int i) {
        StringBuilder sb = new StringBuilder(s);
        int charsToGo = i - sb.length();
        while (charsToGo > 0) {
            sb.insert(0, '0');
            charsToGo--;
        }
        return sb.toString();
    }

回答by asus

As referenced from 'Cracking the Coding Interview', an alternative solution exists:

正如“破解编码面试”中所引用的,存在一种替代解决方案:

boolean isUniqueChars(String str) {
  if(str.length() > 128) return false;

  boolean[] char_set = new boolean[128];
  for(int i = 0; i < str.length(); i++) {
    int val = str.charAt(i);

    if(char_set[val]) {
      return false;
    }
    char_set[val] = true;
  }
  return true;
}

Of course, to achieve better space complexity, please refer to the above example by @templatetypedef

当然,为了获得更好的空间复杂度,请参考@templatetypedef上面的例子

回答by Kshitij G

What if the string has special characters? For eg: what if the input is "**&^^&><:"?" Is there a way to do this in the main loop while parsing the string? of course I could put a check like below

如果字符串有特殊字符怎么办?例如:如果输入是“**&^^&><:”怎么办?在解析字符串的同时,有没有办法在主循环中做到这一点?当然我可以进行如下检查

for character in string_array:
    val = ord(character) - ord('A')
    if val<ord('A') and val>ord('z')
        return False

However, this will fail for character '^' as its ascii is 93. Just trying to figure out if a better solution exists to handle this

但是,这对于字符 '^' 将失败,因为它的 ascii 是 93。只是想弄清楚是否存在更好的解决方案来处理这个问题