Python:比较两个csv文件并打印出差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38996033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python : Compare two csv files and print out differences
提问by Nick Yellow
I need to compare two CSV files and print out differences in a third CSV file. In my case, the first CSV is a old list of hash named old.csv and the second CSV is the new list of hash which contains both old and new hash.
我需要比较两个 CSV 文件并打印出第三个 CSV 文件中的差异。就我而言,第一个 CSV 是一个名为 old.csv 的旧哈希列表,第二个 CSV 是包含旧哈希和新哈希的新哈希列表。
Here is my code :
这是我的代码:
import csv
t1 = open('old.csv', 'r')
t2 = open('new.csv', 'r')
fileone = t1.readlines()
filetwo = t2.readlines()
t1.close()
t2.close()
outFile = open('update.csv', 'w')
x = 0
for i in fileone:
if i != filetwo[x]:
outFile.write(filetwo[x])
x += 1
outFile.close()
The third file is a copy of the old one and not the update. What's wrong ? I Hope you can help me, many thanks !!
第三个文件是旧文件的副本,而不是更新文件。怎么了 ?希望您能帮帮我,非常感谢!!
PS : i don't want to use diff
PS:我不想使用差异
回答by Chris Mueller
The problem is that you are comparing each line in fileone
to the same line in filetwo
. As soon as there is an extra line in one file you will find that the lines are never equal again. Try this:
问题是您正在将中的每一行fileone
与filetwo
. 只要在一个文件中多出一行,您就会发现这些行不再相等。尝试这个:
with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
回答by seler
It feels natural detecting differences using sets.
使用集合检测差异感觉很自然。
#!/usr/bin/env python3
import sys
import argparse
import csv
def get_dataset(f):
return set(map(tuple, csv.reader(f)))
def main(f1, f2, outfile, sorting_column):
set1 = get_dataset(f1)
set2 = get_dataset(f2)
different = set1 ^ set2
output = csv.writer(outfile)
for row in sorted(different, key=lambda x: x[sorting_column], reverse=True):
output.writerow(row)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('infile', nargs=2, type=argparse.FileType('r'))
parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
parser.add_argument('-sc', '--sorting-column', nargs='?', type=int, default=0)
args = parser.parse_args()
main(*args.infile, args.outfile, args.sorting_column)
回答by AdrienW
I assumed your new file was just like your old one, except that some lines were added in between the old ones. The old lines in both files are stored in the same order.
我假设您的新文件与旧文件一样,只是在旧文件之间添加了一些行。两个文件中的旧行以相同的顺序存储。
Try this :
尝试这个 :
with open('old.csv', 'r') as t1:
old_csv = t1.readlines()
with open('new.csv', 'r') as t2:
new_csv = t2.readlines()
with open('update.csv', 'w') as out_file:
line_in_new = 0
line_in_old = 0
while line_in_new < len(new_csv) and line_in_old < len(old_csv):
if old_csv[line_in_old] != new_csv[line_in_new]:
out_file.write(new_csv[line_in_new])
else:
line_in_old += 1
line_in_new += 1
- Note that I used the context manager
with
and some meaningful variable names, which makes it instantly easier to understand. And you don't need thecsv
package since you're not using any of its functionalities here. - About your code, you were almost doing the right thing, except that _you must not go to the next line in your old CSV unless you are reading the same thing in both CSVs. That is to say, if you find a new line, keep reading the new file until you stumble upon an old one and then you'll be able to continue reading.
- 请注意,我使用了上下文管理器
with
和一些有意义的变量名称,这使它立即更容易理解。而且您不需要该csv
软件包,因为您在这里没有使用它的任何功能。 - 关于您的代码,您几乎做对了,除了 _you 不能转到旧 CSV 中的下一行,除非您在两个 CSV 中读取相同的内容。也就是说,如果你找到一个新行,继续阅读新文件,直到你偶然发现一个旧文件,然后你就可以继续阅读了。
UPDATE:This solution is not as pretty as Chris Mueller's onewhich is perfect and very Pythonic for small files, but it only reads the files once (keeping the idea of your original algorithm), thus it can be better if you have larger file.
更新:这个解决方案不像Chris Mueller 的解决方案那么漂亮,它对于小文件来说是完美且非常 Pythonic 的,但它只读取一次文件(保留原始算法的想法),因此如果你有更大的文件会更好。
回答by Milton
回答by Aaksh Kumar
with open('first_test_pipe.csv', 'r') as t1, open('validation.csv', 'r') as t2:
filecoming = t1.readlines()
filevalidation = t2.readlines()
for i in range(0,len(filevalidation)):
coming_set = set(filecoming[i].replace("\n","").split(","))
validation_set = set(filevalidation[i].replace("\n","").split(","))
ReceivedDataList=list(validation_set.intersection(coming_set))
NotReceivedDataList=list(coming_set.union(validation_set)-
coming_set.intersection(validation_set))
print(NotReceivedDataList)