在大文件 Java 中查找重复的行

Question

提问by walterudoing

So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates.

所以，我有一个包含 300 万行单词的大文件。我需要看看是否有任何重复。

I put the lines in a TreeMap so that they are sorted, put "lines" into key and give "1" to their value. When there is a duplicate, the value of the line stacks up. Then I will have to see if there is any value that is not 1.

我将这些行放在 TreeMap 中，以便对它们进行排序，将“行”放入键中并为其值赋予“1”。当有重复时，该行的值会累加。然后我将不得不看看是否有任何不是 1 的值。

Here is my code:

这是我的代码：

    BufferedReader list = new BufferedReader( new FileReader( args[0] ) );
    String line;
    TreeMap<String,Integer> map  = new TreeMap<String,Integer>();

    while ( (line = list.readLine()) != null )
    {
        if (!map.containsKey(line)) 
        {
            map.put(line, 0);
        }
        map.put(line, map.get(line) + 1);   
    }

    if ( !map.containsKey(1)  )
    {
        System.out.print("NOT UNIQUE");
    }
    else
    {
        System.out.print("UNIQUE");
    }
    list.close();
}

Question:

题：

Will the use of TreeMap speed up the process? Or using HashMap will have the same/faster speed?
The output:
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at java.lang.Integer.compareTo(Integer.java:52) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.containsKey(TreeMap.java:227) at Lab10.main(Lab10.java:22)

使用 TreeMap 会加快进程吗？或者使用 HashMap 会有相同/更快的速度？
输出：
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at java.lang.Integer.compareTo(Integer.java:52) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.containsKey(TreeMap.java:227) at Lab10.main(Lab10.java:22)

which is if ( !map.containsKey(1) ), but I don't know what went wrong.

这是if ( !map.containsKey(1) )，但我不知道出了什么问题。

Answer 1

采纳答案by Norbert Radyk

The most efficient implementation really depends on your requirements.

最有效的实施实际上取决于您的要求。

From what you've written: So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates., I assume you're only looking to check whether there is a duplicate line.

从你写的内容来看：所以，我有一个包含 300 万行单词的大文件。我需要看看是否有任何重复。，我假设您只是想检查是否有重复的行。

In such case you don't need to count how many duplicates there are and using the HashSet and the old, good string hashing function might be good enough (or even better).

在这种情况下，您不需要计算有多少重复项，使用 HashSet 和旧的、好的字符串散列函数可能就足够了（甚至更好）。

Here's the example:

这是示例：

boolean hasDuplicate = false;
Set<String> lines = new HashSet<String>();
while ( (line = list.readLine()) != null && !hasDuplicate )
    {
        if (lines.contains(line)) {
            hasDuplicate = true;
        }
        lines.add(line);
    }

    if (hasDuplicate){
        System.out.print("NOT UNIQUE");
    } else {
        System.out.print("UNIQUE");
    }
    list.close();
}

Answer 2

回答by yoyosir

The key in your map is String so you cannot put integer as a key. try

地图中的键是 String，因此您不能将整数作为键。尝试

if ( !map.containsKey("" + 1)  )

If you are trying to find duplicate. Maybe you can do this:

如果您正在尝试查找重复项。也许你可以这样做：

boolean flag = false;
while ( (line = list.readLine()) != null )
    {
        if (!map.containsKey(line)) 
        {
            map.put(line, 0);
        }
        else 
        {
            flag = true;
            break;
        }
    }

    if (flag )
    {
        System.out.print("NOT UNIQUE");
    }
    else
    {
        System.out.print("UNIQUE");
    }
    list.close();
}

Also since you are not using the value just the key, you can use HashSet instead.

此外，由于您不只使用键的值，您可以使用 HashSet 代替。

Answer 3

回答by kriyeta

since you are simply inserting line and occurrence. later you are retrieving one by one so no need of sorted map you can use HashMap.

因为您只是插入行和事件。稍后您将一一检索，因此不需要排序的地图，您可以使用 HashMap。

and since key type is String so integer can not be passed.

并且由于键类型是字符串，因此不能传递整数。

i think you want know the line whose occurrence is one. so you can try:

我想你想知道出现的那一行。所以你可以尝试：

if(map.get(line)!=1)

{

System.out.print("NOT UNIQUE");

}

else

别的

{

System.out.print("UNIQUE");

}

Answer 4

回答by craftsmannadeem

This is a well known problem called Count-Distinct ProblemThere are various algorithms :

这是一个众所周知的问题，称为Count-Distinct Problem有多种算法：

In Java you can use BitSet

在 Java 中，您可以使用BitSet

Answer 5

回答by Shaini Sinha

All you need to know is that Set doesn't allow duplicates in Java. Which means if you have added an element into Set and trying to insert duplicate element again, it will not be allowed. In Java, you can use HashSet class to solve this problem. Just loop over array elements, insert them into HashSet using add() method and check return value. If add() returns false it means that element is not allowed in the Set and that is your duplicate. Here is the code sample to do this :

您只需要知道 Set 不允许在 Java 中重复。这意味着如果您已将元素添加到 Set 并尝试再次插入重复元素，则不允许这样做。在 Java 中，可以使用 HashSet 类来解决这个问题。只需循环数组元素，使用 add() 方法将它们插入 HashSet 并检查返回值。如果 add() 返回 false，则表示 Set 中不允许该元素，并且该元素是您的重复项。这是执行此操作的代码示例：

for (String name : names) {
 if (set.add(name) == false) {
    // your duplicate element
 }}

Complexity of this solution is O(n) because you are only going through array one time, but it also has space complexity of O(n) because of HashSet data structure, which contains your unique elements. So if an array contains 1 million elements, in worst case you would need an HashSet to store those 1 million elements.

此解决方案的复杂度为 O(n)，因为您只遍历数组一次，但由于包含您的唯一元素的 HashSet 数据结构，它的空间复杂度也为 O(n)。因此，如果一个数组包含 100 万个元素，在最坏的情况下，您将需要一个 HashSet 来存储这 100 万个元素。

Answer 6

回答by Anu

Class cast Exception occurs because the datatypes are different. In case of TreeMap it doesn't support heterogeneous datatype.

由于数据类型不同，会发生类转换异常。在 TreeMap 的情况下，它不支持异构数据类型。

在大文件 Java 中查找重复的行

提问by walterudoing

采纳答案by Norbert Radyk

回答by yoyosir

回答by kriyeta

回答by craftsmannadeem

回答by Shaini Sinha

回答by Anu

相关推荐

最近更新

标签

在大文件 Java 中查找重复的行

提问by walterudoing

采纳答案by Norbert Radyk

回答by yoyosir

回答by kriyeta

回答by craftsmannadeem

回答by Shaini Sinha

回答by Anu

相关推荐

Java 如何将 outputStream 转换为字节数组？

Java Android GridLayout 获取行/列

Java 从片段访问 SQLite 数据库

SpelEvaluationException: EL1004E:(pos 0): Method call: Method hasPermission(java.lang.String) 在 MethodSecurityExpressionRoot 类型上

相关推荐

最近更新

标签