C# 比较 DataTable 中的所有行 - 识别重复记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/664021/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 12:24:21  来源:igfitidea点击:

compare all rows in DataTable - identify duplicate records

c#.netasp.netlinqnormalization

提问by kiev

I would like to normalize data in a DataTable insertRowswithout a key. To do that I need to identify and mark duplicate records by finding their ID (import_id). Afterwards I will select only the distinct ones. The approach I am thinking of is to compare each row against all rows in that DataTable insertRows

我想在DataTable insertRows没有密钥的情况下规范化数据。为此,我需要通过查找 ID ( import_id)来识别和标记重复记录。之后我将只选择不同的。我正在考虑的方法是将每一行与该 DataTable 中的所有行进行比较insertRows

The columns in the DataTable are not known at design time, and there is no key. Performance-wise, the table would have as much as 10k to 20k records and about 40 columns

DataTable 中的列在设计时是未知的,并且没有键。在性能方面,该表将有多达 10k 到 20k 条记录和大约 40 列

How do I accomplish this without sacrificing performance too much?

如何在不牺牲太多性能的情况下实现这一目标?

I attempted using linq but I did not know how to dynamically specify the where criteria Here I am comparing first and last names in a loop for each row

我尝试使用 linq,但我不知道如何动态指定 where 条件这里我正在比较每行循环中的名字和姓氏

foreach (System.Data.DataRow lrows in importDataTable.Rows)
{
    IEnumerable<System.Data.DataRow> insertRows = importDataTable.Rows.Cast<System.Data.DataRow>();

    var col_matches =
    from irows in insertRows
    where
    String.Compare(irows["fname"].ToString(), lrows["fname"].ToString(), true).Equals(0)
    &&
    String.Compare(irows["last_name"].ToString(), lrows["last_name"].ToString(),true).Equals(0)

    select new { import_id = irows["import_id"].ToString() };
}

Any ideas are welcome. How do I find similar column names using linq?>my similar question

欢迎任何想法。如何使用 linq 找到相似的列名?>我的类似问题

采纳答案by Waylon Flinn

The easiest way to get this done without O(n2) complexity is going to be using a data structure that efficiently implements Set operations, specifically a Contains operation. Fortunately .NET (as of 3.0) contains the HashSetobject which does this for you. In order to make use of this you're going to need a single object that encapsulates a row in your DataTable.

在没有 O(n 2) 复杂度的情况下完成此操作的最简单方法是使用有效实现 Set 操作的数据结构,特别是包含操作。幸运的是 .NET(从 3.0 开始)包含HashSet对象,它可以为您执行此操作。为了利用这一点,您将需要一个单独的对象来封装 DataTable 中的一行。

If DataRow won't work, I recommend converting relevant records into strings, concatenating them then placing those in the HashSet. Before you insert a row check to see if the HashSet already contains it (using Contains). If it does, you've found a duplicate.

如果 DataRow 不起作用,我建议将相关记录转换为字符串,将它们连接起来,然后将它们放在 HashSet 中。在插入一行之前,请检查 HashSet 是否已经包含它(使用 Contains)。如果是,则您已找到重复项。

Edit:

编辑:

This method is O(n).

这种方法是 O(n)。

回答by Daniel Brückner

I am not sure if I understand the question correctly, but when dealing with System.Data.DataTable the following should work.

我不确定我是否正确理解了这个问题,但是在处理 System.Data.DataTable 时,以下应该有效。

for (Int32 r0 = 0; r0 < dataTable.Rows.Count; r0++)
{
   for (Int32 r1 = r0 + 1; r1 < dataTable.Rows.Count; r1++)
   {
      Boolean rowsEqual = true;

      for (Int32 c = 0; c < dataTable.Columns.Count; c++)
      {
         if (!Object.Equals(dataTable.Rows[r0][c], dataTable.Rows[r1][c])
         {
            rowsEqual = false;
            break;
         }
      }

      if (rowsEqual)
      {
         Console.WriteLine(
            String.Format("Row {0} is a duplicate of row {1}.", r0, r1))
      }
   }
}

回答by SqlRyan

I'm not too knowledgable about LINQ, but can you use the .Distinct() operator?

我对 LINQ 不太了解,但是您可以使用 .Distinct() 运算符吗?

http://blogs.msdn.com/charlie/archive/2006/11/19/linq-farm-group-and-distinct.aspx

http://blogs.msdn.com/charlie/archive/2006/11/19/linq-farm-group-and-distinct.aspx

Your question doesn't make clear whether you need to specifically identify duplicate rows, or whether you're just looking to remove them from your query. Adding "Distinct" would remove the extra instances, though it wouldn't necessarily tell you what they were.

您的问题并不清楚您是否需要专门识别重复的行,或者您是否只是想从查询中删除它们。添加“Distinct”将删除额外的实例,尽管它不一定会告诉您它们是什么。