C# 比较两个数据表以确定其中一个而不是另一个中的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/164144/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compare two DataTables to determine rows in one but not the other
提问by Jon
I have two DataTables, A
and B
, produced from CSV files. I need to be able to check which rows exist in B
that do not exist in A
.
我有两个从 CSV 文件生成的DataTablesA
和B
。我需要能够检查哪些行B
存在于A
.
Is there a way to do some sort of query to show the different rows or would I have to iterate through each row on each DataTable to check if they are the same? The latter option seems to be very intensive if the tables become large.
有没有办法做某种查询来显示不同的行,或者我是否必须遍历每个 DataTable 上的每一行以检查它们是否相同?如果表变大,后一种选择似乎非常密集。
采纳答案by Orion Edwards
would I have to iterate through each row on each DataTable to check if they are the same.
我是否必须遍历每个 DataTable 上的每一行以检查它们是否相同。
Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.
当您从 CSV 文件加载数据时,您将不会有任何索引或任何东西,因此在某些时候,必须遍历每一行,无论是您的代码还是库, 管他呢。
Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:
无论如何,这是一个算法问题,这不是我的专长,但我的天真方法如下:
1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:
1:你能利用数据的任何属性吗?每个表中的所有行是否都是唯一的,您能否按照相同的标准对它们进行排序?如果是这样,你可以这样做:
- Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
- Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.
- 按 ID 对两个表进行排序(使用一些有用的东西,如快速排序)。如果它们已经排序,那么你就赢了。
- 一次遍历两个表,跳过任一表中 ID 中的任何空白。匹配 ID 的意思是重复记录。
This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes)
这允许你在 (sort time * 2 ) + 一次通过中完成,所以如果我的 big-O-notation 是正确的,它会是 (whatever-sort-time) + O(m+n) 这非常好.
(修订:这是ΤΖΩΤΖΙΟΥ 描述的方法)
2: An alternative approach, which may be more or less efficient depending on how big your data is:
2:另一种方法,它可能或多或少的效率取决于您的数据有多大:
- Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
- Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.
- 遍历表 1,对于每一行,将它的 ID(或计算出的哈希码,或该行的其他一些唯一 ID)粘贴到字典中(或哈希表,如果你喜欢这样称呼它)。
- 遍历表 2,对于每一行,查看字典中是否存在 ID(或哈希码等)。您正在利用字典速度非常快的事实-我认为是 O(1)?抬头。这一步会非常快,但您将付出代价来完成所有这些字典插入。
I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)
我真的很想看看比我更了解算法的人为这个提出了什么:-)
回答by MusiGenesis
You can use the Merge and GetChanges methods on the DataTable to do this:
您可以使用 DataTable 上的 Merge 和 GetChanges 方法来执行此操作:
A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
回答by tzot
Just FYI:
仅供参考:
Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:
一般来说,关于算法,比较两组可排序的(通常是 id)不是 O(M*N/2) 操作,而是 O(M+N) 如果这两组是有序的。因此,您使用指向另一个表开头的指针扫描一个表,然后:
other_item= A.first()
only_in_B= empty_list()
for item in B:
while other_item > item:
other_item= A.next()
if A.eof():
only_in_B.add( all the remaining B items)
return only_in_B
if item < other_item:
empty_list.append(item)
return only_in_B
The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.
上面的代码显然是伪代码,但如果您决定自己编写代码,它应该为您提供一般要点。
回答by Jon Skeet
Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)
假设你有一个适当类型的 ID 列(即给出一个哈希码并实现相等) - 在这个例子中的字符串,它是稍微伪代码,因为我不熟悉 DataTables 并且没有时间查看所有内容刚刚起来:)
IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);
回答by Robert Rossney
The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.
到目前为止的答案假设您只是在寻找重复的主键。这是一个非常简单的问题 - 例如,您可以使用 Merge() 方法。
But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)
但我理解您的问题意味着您正在寻找重复的 DataRows。(根据您对问题的描述,两个表都是从 CSV 文件导入的,我什至假设原始行没有主键值,并且在导入过程中通过自动编号分配了任何主键。)
The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.
天真的实现(对于 A 中的每一行,将其 ItemArray 与 B 中每一行的 ItemArray 进行比较)确实会在计算上很昂贵。
A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow>
that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.
一种更便宜的方法是使用散列算法。对于每个 DataRow,将其列的字符串值连接成一个字符串,然后对该字符串调用 GetHashCode() 以获取一个 int 值。Dictionary<int, DataRow>
为 DataTable B 中的每个 DataRow创建一个包含以哈希码为键的条目。然后,为 DataTable A 中的每个 DataRow 计算哈希码,并查看它是否包含在字典中。如果不是,则您知道 DataTable B 中不存在 DataRow。
This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.
这种方法有两个弱点,这两个弱点都源于两个字符串可能不相等但产生相同的哈希码。如果您在 A 中找到哈希值在字典中的行,则需要检查字典中的 DataRow 以验证这两行是否真的相等。
The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>
, and you should perform the check described in the previous paragraph against each DataRow in the list.
第二个弱点更严重:B 中的两个不同的 DataRow 不太可能但可能会散列到相同的键值。出于这个原因,字典实际上应该是 a Dictionary<int, List<DataRow>>
,并且您应该对列表中的每个 DataRow 执行上一段中描述的检查。
It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.
让它工作需要大量的工作,但它是一个 O(m+n) 算法,我认为它会尽可能好。
回答by Jon
Thanks for all the feedback.
感谢所有反馈。
I do not have any index's unfortunately. I will give a little more information about my situation.
不幸的是,我没有任何索引。我将提供更多关于我的情况的信息。
We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.
我们有一个报告程序(取代 Crystal 报告)安装在整个欧盟的 7 个服务器中。这些服务器有很多关于它们的报告(每个国家/地区都不一样)。它们由使用 XML 文件进行配置的命令行应用程序调用。所以一个 XML 文件可以调用多个报告。
The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.
命令行应用程序由我们的通宵流程安排和控制。因此可以从多个位置调用 XML 文件。
The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.
CSV 的目标是生成所有正在使用的报告以及从何处调用它们的列表。
I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).
我正在浏览所有引用的 XML 文件,查询调度程序并生成所有报告的列表。(这还不算太糟糕)。
The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).
我遇到的问题是我必须保留一份可能已从生产中删除的所有报告的列表。所以我需要将旧的 CSV 与新数据进行比较。为此,我认为最好将它放入 DataTables 并比较信息,(这可能是错误的方法。我想我可以创建一个对象来保存它并比较差异然后创建迭代它们)。
The data I have about each report is as follows:
我对每份报告的数据如下:
String - Task Name String - Action Name Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file). String - XML File called String - Report Name
String - 任务名称 String - 动作名称 Int - ActionID(动作 ID 可以在多个记录中,因为单个动作可以调用多个报告,即一个 XML 文件)。字符串 - 名为字符串的 XML 文件 - 报告名称
I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).
我将尝试 MusiGenesis 给出的合并想法(谢谢)。(重读一些帖子不确定合并是否有效,但值得一试,因为我之前没有听说过它,所以要学习一些新东西)。
The HashCode Idea sounds interesting as well.
HashCode 的想法听起来也很有趣。
Thanks for all the advice.
感谢所有的建议。
回答by Jon
public DataTable compareDataTables(DataTable First, DataTable Second)
{
First.TableName = "FirstTable";
Second.TableName = "SecondTable";
//Create Empty Table
DataTable table = new DataTable("Difference");
DataTable table1 = new DataTable();
try
{
//Must use a Dataset to make use of a DataRelation object
using (DataSet ds4 = new DataSet())
{
//Add tables
ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });
//Get Columns for DataRelation
DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];
for (int i = 0; i < firstcolumns.Length; i++)
{
firstcolumns[i] = ds4.Tables[0].Columns[i];
}
DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];
for (int i = 0; i < secondcolumns.Length; i++)
{
secondcolumns[i] = ds4.Tables[1].Columns[i];
}
//Create DataRelation
DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);
ds4.Relations.Add(r);
//Create columns for return table
for (int i = 0; i < First.Columns.Count; i++)
{
table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
}
//If First Row not in Second, Add to return table.
table.BeginLoadData();
foreach (DataRow parentrow in ds4.Tables[0].Rows)
{
DataRow[] childrows = parentrow.GetChildRows(r);
if (childrows == null || childrows.Length == 0)
table.LoadDataRow(parentrow.ItemArray, true);
table1.LoadDataRow(childrows, false);
}
table.EndLoadData();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return table;
}
回答by Jon
try
{
if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
{
for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
{
for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
{
if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
{
}
else
{
MessageBox.Show(i.ToString() + "," + j.ToString());
}
}
}
}
else
{
MessageBox.Show("Table has different columns ");
}
}
catch (Exception)
{
MessageBox.Show("Please select The Table");
}
回答by Ying
I'm continuing tzot's idea ...
我继续tzot的想法......
If you have two sortable sets, then you can just use:
如果你有两个可排序的集合,那么你可以使用:
List<string> diffList = new List<string>(sortedListA.Except(sortedListB));
If you need more complicated objects, you can define a comparator yourself and still use it.
如果你需要更复杂的对象,你可以自己定义一个比较器并仍然使用它。
回答by NewCsharper
I found an easy way to solve this. Unlike previous "except method" answers, I use the except method twice. This not only tells you what rows were deleted but what rows were added. If you only use one except method - it will only tell you one difference and not both. This code is tested and works. See below
我找到了一个简单的方法来解决这个问题。与之前的“except method”答案不同,我使用了两次 except 方法。这不仅告诉您删除了哪些行,还告诉您添加了哪些行。如果您只使用一种 except 方法 - 它只会告诉您一个区别而不是两者。此代码经过测试并有效。见下文
//Pass in your two datatables into your method
//build the queries based on id.
var qry1 = datatable1.AsEnumerable().Select(a => new { ID = a["ID"].ToString() });
var qry2 = datatable2.AsEnumerable().Select(b => new { ID = b["ID"].ToString() });
//detect row deletes - a row is in datatable1 except missing from datatable2
var exceptAB = qry1.Except(qry2);
//detect row inserts - a row is in datatable2 except missing from datatable1
var exceptAB2 = qry2.Except(qry1);
then execute your code against the results
然后根据结果执行您的代码
if (exceptAB.Any())
{
foreach (var id in exceptAB)
{
//execute code here
}
}
if (exceptAB2.Any())
{
foreach (var id in exceptAB2)
{
//execute code here
}
}