C# 从数据表中删除重复项的最佳方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/340223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 23:49:26  来源:igfitidea点击:

What is the best way to remove duplicates from a datatable?

c#datatableduplicates

提问by Khaja Minhajuddin

I have checked the whole site and googled on the net but was unable to find a simple solution to this problem.

我检查了整个网站并在网上搜索,但无法找到解决此问题的简单方法。

I have a datatable which has about 20 columns and 10K rows. I need to remove the duplicate rows in this datatable based on 4 key columns. Doesn't .Net have a function which does this? The function closest to what I am looking for was datatable.DefaultView.ToTable(true, array of columns to display), But this function does a distinct on allthe columns.

我有一个大约有 20 列和 10K 行的数据表。我需要根据 4 个键列删除此数据表中的重复行。.Net 没有执行此操作的功能吗?最接近我正在寻找的函数是 datatable.DefaultView.ToTable(true, array of columns to display),但是这个函数在所有列上不同。

It would be great if someone could help me with this.

如果有人能帮我解决这个问题,那就太好了。

EDIT: I am sorry for not being clear on this. This datatable is being created by reading a CSV file and not from a DB. So using an SQL query is not an option.

编辑:我很抱歉对此不清楚。该数据表是通过读取 CSV 文件而不是从数据库创建的。所以使用 SQL 查询不是一种选择。

采纳答案by Eduardo Campa?ó

You can use Linq to Datasets. Check this. Something like this:

您可以使用 Linq to Datasets。检查这个。像这样的东西:

// Fill the DataSet.
DataSet ds = new DataSet();
ds.Locale = CultureInfo.InvariantCulture;
FillDataSet(ds);

List<DataRow> rows = new List<DataRow>();

DataTable contact = ds.Tables["Contact"];

// Get 100 rows from the Contact table.
IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
                              select c).Take(100);

DataTable contactsTableWith100Rows = query.CopyToDataTable();

// Add 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

// Create duplicate rows by adding the same 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

DataTable table =
    System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);

// Find the unique contacts in the table.
IEnumerable<DataRow> uniqueContacts =
    table.AsEnumerable().Distinct(DataRowComparer.Default);

Console.WriteLine("Unique contacts:");
foreach (DataRow uniqueContact in uniqueContacts)
{
    Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
}

回答by Samiksha

Use a query instead of functions:

使用查询代替函数:

DELETE FROM table1 AS tb1 INNER JOIN 
(SELECT id, COUNT(id) AS cntr FROM table1 GROUP BY id) AS tb2
ON tb1.id = tb2.id WHERE tb2.cntr > 1

回答by liggett78

How can I remove duplicate rows?. (Adjust the query there to join on your 4 key columns)

如何删除重复的行?. (调整那里的查询以加入您的 4 个关键列)

EDIT: with your new information I believe the easiest way would be to implement IEqualityComparer<T> and use Distinct on your data rows. Otherwise if you're working with IEnumerable/IList instead of DataTable/DataRow, it is certainly possible with some LINQ-to-objects kung-fu.

编辑:根据您的新信息,我认为最简单的方法是实现 IEqualityComparer<T> 并在您的数据行上使用 Distinct 。否则,如果您使用 IEnumerable/IList 而不是 DataTable/DataRow,那么使用某些 LINQ-to-objects 功夫当然是可能的。

EDIT: example IEqualityComparer

编辑:示例 IEqualityComparer

public class MyRowComparer : IEqualityComparer<DataRow>
{

    public bool Equals(DataRow x, DataRow y)
    {
        return (x.Field<int>("ID") == y.Field<int>("ID")) &&
            string.Compare(x.Field<string>("Name"), y.Field<string>("Name"), true) == 0 &&
          ... // extend this to include all your 4 keys...
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<int>("ID").GetHashCode() ^ obj.Field<string>("Name").GetHashCode() etc.
    }
}

You can use it like this:

你可以这样使用它:

var uniqueRows = myTable.AsEnumerable().Distinct(MyRowComparer);

回答by terjetyl

If you have access to Linq I think you should be able to use the built in group functionality on the in memory collection and pick out the duplicate rows

如果您有权访问 Linq,我认为您应该能够在内存集合中使用内置的组功能并挑选出重复的行

Search Google for Linq Group by for examples

在 Google 中搜索 Linq Group by 示例

回答by Eduardo Campa?ó

Liggett78's answer is much better - esp. as mine had an error! Correction as follows...

Liggett78 的答案要好得多 - 尤其是。因为我的有错误!更正如下...

DELETE TableWithDuplicates
    FROM TableWithDuplicates
        LEFT OUTER JOIN (
            SELECT PK_ID = Min(PK_ID), --Decide your method for deciding which rows to keep
                KeyColumn1,
                KeyColumn2,
                KeyColumn3,
                KeyColumn4
                FROM TableWithDuplicates
                GROUP BY KeyColumn1,
                    KeyColumn2,
                    KeyColumn3,
                    KeyColumn4
            ) AS RowsToKeep
            ON TableWithDuplicates.PK_ID = RowsToKeep.PK_ID
    WHERE RowsToKeep.PK_ID IS NULL

回答by Treb

Found this on bytes.com:

bytes.com上找到了这个:

You can use the JET 4.0 OLE DB provider with the classes in the System.Data.OleDb namespace to access the comma delimited text file (using a DataSet/DataTable).

Or you could use Microsoft Text Driver for ODBC with the classes in the System.Data.Odbc namespace to access the file using ODBC drivers.

您可以使用 JET 4.0 OLE DB 提供程序和 System.Data.OleDb 命名空间中的类来访问逗号分隔的文本文件(使用 DataSet/DataTable)。

或者,您可以将 Microsoft Text Driver for ODBC 与 System.Data.Odbc 命名空间中的类一起使用,以使用 ODBC 驱动程序访问文件。

That would allow you to access your data via sql queries, as others proposed.

正如其他人所建议的那样,这将允许您通过 sql 查询访问您的数据。

回答by JeeBee

"This datatable is being created by reading a CSV file and not from a DB."

“这个数据表是通过读取 CSV 文件而不是从数据库创建的。”

So put a unique constraint on the four columns in the database, and inserts that are duplicates under your design won't go in. Unless it decides to fail instead of continuing when this happens, but this surely is configurable in your CSV import script.

因此,对数据库中的四列设置唯一约束,并且不会插入在您的设计下重复的插入。除非它决定失败而不是在发生这种情况时继续,但这肯定可以在您的 CSV 导入脚本中进行配置。

回答by Srikanth V M

This is a very simple code which doesnot require linq nor individual columns to do the filter. If all the values of columns in a row are null it will be deleted.

这是一个非常简单的代码,不需要 linq 或单独的列来进行过滤。如果一行中所有列的值都为空,它将被删除。



    public DataSet duplicateRemoval(DataSet dSet) 
{
    bool flag;
    int ccount = dSet.Tables[0].Columns.Count;
    string[] colst = new string[ccount];
    int p = 0;

    DataSet dsTemp = new DataSet();
    DataTable Tables = new DataTable();
    dsTemp.Tables.Add(Tables);

    for (int i = 0; i < ccount; i++)
    {
        dsTemp.Tables[0].Columns.Add(dSet.Tables[0].Columns[i].ColumnName, System.Type.GetType("System.String"));
    }

    foreach (System.Data.DataRow row in dSet.Tables[0].Rows)
    {
        flag = false;
        p = 0;
        foreach (System.Data.DataColumn col in dSet.Tables[0].Columns)
        {
            colst[p++] = row[col].ToString();
            if (!string.IsNullOrEmpty(row[col].ToString()))
            {  //Display only if any of the data is present in column
                flag = true;
            }
        }
        if (flag == true)
        {
            DataRow myRow = dsTemp.Tables[0].NewRow();
            //Response.Write("<tr style=\"background:#d2d2d2;\">");
            for (int kk = 0; kk < ccount; kk++)
            {
                myRow[kk] = colst[kk];         

                // Response.Write("<td class=\"table-line\" bgcolor=\"#D2D2D2\">" + colst[kk] + "</td>");
            }
            dsTemp.Tables[0].Rows.Add(myRow);
        }
    } return dsTemp;
}


This can even be used to remove null data from excel sheet.

这甚至可以用于从 Excel 工作表中删除空数据。

回答by Alexey

It should be taken into account that Table.AcceptChanges() must be called to complete the deletion. Otherwise deleted row is still present in DataTable with RowState set to Deleted. And Table.Rows.Count is not changed after deletion.

应该考虑到必须调用 Table.AcceptChanges() 才能完成删除。否则删除的行仍然存在于 DataTable 中,RowState 设置为 Deleted。并且 Table.Rows.Count 删除后不会改变。

回答by Suhas Patil

Try this

尝试这个

Let us consider dtInput is your data table with duplicate records.

让我们考虑 dtInput 是具有重复记录的数据表。

I have a new DataTable dtFinal in which I want to filter the duplicate rows.

我有一个新的 DataTable dtFinal,我想在其中过滤重复的行。

So my code will be something like below.

所以我的代码将如下所示。

DataTable dtFinal = dtInput.DefaultView.ToTable(true, 
                           new string[ColumnCount] {"Col1Name","Col2Name","Col3Name",...,"ColnName"});