Python:比较两个csv文件并打印出差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38996033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:44:44  来源:igfitidea点击:

Python : Compare two csv files and print out differences

pythoncsv

提问by Nick Yellow

I need to compare two CSV files and print out differences in a third CSV file. In my case, the first CSV is a old list of hash named old.csv and the second CSV is the new list of hash which contains both old and new hash.

我需要比较两个 CSV 文件并打印出第三个 CSV 文件中的差异。就我而言,第一个 CSV 是一个名为 old.csv 的旧哈希列表,第二个 CSV 是包含旧哈希和新哈希的新哈希列表。

Here is my code :

这是我的代码:

import csv
t1 = open('old.csv', 'r')
t2 = open('new.csv', 'r')
fileone = t1.readlines()
filetwo = t2.readlines()
t1.close()
t2.close()

outFile = open('update.csv', 'w')
x = 0
for i in fileone:
    if i != filetwo[x]:
        outFile.write(filetwo[x])
    x += 1
outFile.close()

The third file is a copy of the old one and not the update. What's wrong ? I Hope you can help me, many thanks !!

第三个文件是旧文件的副本,而不是更新文件。怎么了 ?希望您能帮帮我,非常感谢!!

PS : i don't want to use diff

PS:我不想使用差异

回答by Chris Mueller

The problem is that you are comparing each line in fileoneto the same line in filetwo. As soon as there is an extra line in one file you will find that the lines are never equal again. Try this:

问题是您正在将中的每一行fileonefiletwo. 只要在一个文件中多出一行,您就会发现这些行不再相等。尝试这个:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

回答by seler

It feels natural detecting differences using sets.

使用集合检测差异感觉很自然。

#!/usr/bin/env python3

import sys
import argparse
import csv


def get_dataset(f):
    return set(map(tuple, csv.reader(f)))


def main(f1, f2, outfile, sorting_column):
    set1 = get_dataset(f1)
    set2 = get_dataset(f2)
    different = set1 ^ set2

    output = csv.writer(outfile)

    for row in sorted(different, key=lambda x: x[sorting_column], reverse=True):
        output.writerow(row)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('infile', nargs=2, type=argparse.FileType('r'))
    parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
    parser.add_argument('-sc', '--sorting-column', nargs='?', type=int, default=0)

    args = parser.parse_args()

    main(*args.infile, args.outfile, args.sorting_column)

回答by AdrienW

I assumed your new file was just like your old one, except that some lines were added in between the old ones. The old lines in both files are stored in the same order.

我假设您的新文件与旧文件一样,只是在旧文件之间添加了一些行。两个文件中的旧行以相同的顺序存储。

Try this :

尝试这个 :

with open('old.csv', 'r') as t1:
    old_csv = t1.readlines()
with open('new.csv', 'r') as t2:
    new_csv = t2.readlines()

with open('update.csv', 'w') as out_file:
    line_in_new = 0
    line_in_old = 0
    while line_in_new < len(new_csv) and line_in_old < len(old_csv):
        if old_csv[line_in_old] != new_csv[line_in_new]:
            out_file.write(new_csv[line_in_new])
        else:
            line_in_old += 1
        line_in_new += 1
  • Note that I used the context manager withand some meaningful variable names, which makes it instantly easier to understand. And you don't need the csvpackage since you're not using any of its functionalities here.
  • About your code, you were almost doing the right thing, except that _you must not go to the next line in your old CSV unless you are reading the same thing in both CSVs. That is to say, if you find a new line, keep reading the new file until you stumble upon an old one and then you'll be able to continue reading.
  • 请注意,我使用了上下文管理器with和一些有意义的变量名称,这使它立即更容易理解。而且您不需要该csv软件包,因为您在这里没有使用它的任何功能。
  • 关于您的代码,您几乎做对了,除了 _you 不能转到旧 CSV 中的下一行,除非您在两个 CSV 中读取相同的内容。也就是说,如果你找到一个新行,继续阅读新文件,直到你偶然发现一个旧文件,然后你就可以继续阅读了。

UPDATE:This solution is not as pretty as Chris Mueller's onewhich is perfect and very Pythonic for small files, but it only reads the files once (keeping the idea of your original algorithm), thus it can be better if you have larger file.

更新:这个解决方案不像Chris Mueller 的解决方案那么漂亮,它对于小文件来说是完美且非常 Pythonic 的,但它只读取一次文件(保留原始算法的想法),因此如果你有更大的文件会更好。

回答by Milton

You may find this package useful (csv-diff):

你可能会发现这个包很有用(csv-diff):

pip install csv-diff

Once installed, you can run it from the command line:

安装后,您可以从命令行运行它:

csv-diff one.csv two.csv --key=id

回答by Aaksh Kumar

with open('first_test_pipe.csv', 'r') as t1, open('validation.csv', 'r') as t2:
    filecoming = t1.readlines()
    filevalidation = t2.readlines()

for i in range(0,len(filevalidation)):
    coming_set = set(filecoming[i].replace("\n","").split(","))
    validation_set = set(filevalidation[i].replace("\n","").split(","))
    ReceivedDataList=list(validation_set.intersection(coming_set))
    NotReceivedDataList=list(coming_set.union(validation_set)- 
    coming_set.intersection(validation_set))
    print(NotReceivedDataList)