windows C# - 删除文本文件中的重复行

Question

提问by Michael

Could someone demonstrate how a file is checked for duplicate lines, and then any duplicates are removed either overwriting the existing file, or create a new file with the duplicate lines removed

有人可以演示如何检查文件中的重复行，然后删除任何重复项，要么覆盖现有文件，要么创建一个删除重复行的新文件

Answer 1

回答by LukeH

If you're using .NET4 then you could use a combination of File.ReadLinesand File.WriteAllLines:

如果你使用.NET4那么你可以使用的组合File.ReadLines和File.WriteAllLines：

var previousLines = new HashSet<string>();

File.WriteAllLines(destinationPath, File.ReadLines(sourcePath)
                                        .Where(line => previousLines.Add(line)));

This functions in pretty much the same way as LINQ's Distinctmethod, with one important difference: the output of Distinctisn't guaranteed to be in the same order as the input sequence. Using a HashSet<T>explicitly does provide this guarantee.

此方法的功能与 LINQ 的Distinct方法几乎相同，但有一个重要区别：Distinct不能保证的输出与输入序列的顺序相同。HashSet<T>显式使用 a确实提供了这种保证。

Answer 2

回答by Blindy

File.WriteAllLines(topath, File.ReadAllLines(frompath).Distinct().ToArray());

Edit: modified to work in .net 3.5

编辑：修改为在 .net 3.5 中工作

Answer 3

回答by Factor Mystic

How big of a file are we talking?

我们所说的文件有多大？

One strategy could be to read the lines one at a time and load them into a data structure that you can easily check for an existing item, such as a Hashset<int>. I know that I can reliably hash each string line of the file using GetHashCode() (used internally to check string equality- which is what we want to determine duplicates) and just check for known hashes. So, something like

一种策略可能是一次读取一行并将它们加载到您可以轻松检查现有项目的数据结构中，例如Hashset<int>. 我知道我可以使用 GetHashCode() 可靠地散列文件的每个字符串行（内部用于检查字符串相等性 - 这是我们想要确定重复的内容）并且只检查已知的散列值。所以，像

var known = new Hashset<int>();
using (var dupe_free = new StreamWriter(@"c:\path\to\dupe_free.txt"))
{
    foreach(var line in File.ReadLines(@"c:\path\to\has_dupes.txt")
    {
        var hash = line.GetHashCode();
        if (!known.Contains(hash)) 
        {
            known.Add(hash);
            dupe_free.Write(line);
        }
    }
}

Alternately, you can take advantage of Linq's Distinct()method and do it in one line, as Blindy suggested:

或者，您可以利用 Linq 的Distinct()方法并在一行中完成，正如 Blindy 建议的那样：

File.WriteAllLines(@"c:\path\to\dupe_free.txt", File.ReadAllLines((@"c:\path\to\has_dupes.txt").Distinct().ToArray());

Answer 4

回答by Devendra D. Chavan

// Requires .NET 3.5
private void RemoveDuplicate(string sourceFilePath, string destinationFilePath)
{
    var readLines = File.ReadAllLines(sourceFilePath, Encoding.Default);

    File.WriteAllLines(destinationFilePath, readLines.Distinct().ToArray(), Encoding.Default);
}

Answer 5

回答by mrK

PseudoCode:

伪代码：

open file reading only

List<string> list = new List<string>();

for each line in the file:
    if(!list.contains(line)):
        list.append(line)

close file
open file for writing

for each string in list:
    file.write(string);

windows C# - 删除文本文件中的重复行

提问by Michael

回答by LukeH

回答by Blindy

回答by Factor Mystic

回答by Devendra D. Chavan

回答by mrK

相关推荐

最近更新

标签

windows C# - 删除文本文件中的重复行

提问by Michael

回答by LukeH

回答by Blindy

回答by Factor Mystic

回答by Devendra D. Chavan

回答by mrK

相关推荐

在 Windows 中从纯 C 创建一个唯一的临时目录

windows 窗口信息：WM_CREATE 和 WM_NCCREATE 的区别？

Windows Script Host 在启动时找不到文件

尝试运行 Matlab-Compiler-Runtime 应用程序时，Windows 7 中出现 SxS 错误

相关推荐

最近更新

标签