返回两个文件之间不同的行（Python）

Question

提问by user2597879

I have two files with tens of thousands of lines each, output1.txt and output2.txt. I want to iterate through both files and return the line (and content) of the lines that differ between the two. They're mostly the same which is why I can't find the differences (filecmp.cmp returns false).

我有两个文件，每个文件都有数万行，output1.txt 和 output2.txt。我想遍历两个文件并返回两者之间不同的行的行（和内容）。它们大多相同，这就是为什么我找不到差异的原因（filecmp.cmp 返回 false）。

Answer 1

采纳答案by dawg

You can do something like this:

你可以这样做：

import difflib, sys

tl=100000    # large number of lines

# create two test files (Unix directories...)

with open('/tmp/f1.txt','w') as f:
    for x in range(tl):
        f.write('line {}\n'.format(x))

with open('/tmp/f2.txt','w') as f:
    for x in range(tl+10):   # add 10 lines
        if x in (500,505,1000,tl-2):
            continue         # skip these lines
        f.write('line {}\n'.format(x))        

with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())    
    for line in diff:
        if line.startswith('-'):
            sys.stdout.write(line)
        elif line.startswith('+'):
            sys.stdout.write('\t\t'+line)

Prints (in 400 ms):

打印（400 毫秒）：

- line 500
- line 505
- line 1000
- line 99998
        + line 100000
        + line 100001
        + line 100002
        + line 100003
        + line 100004
        + line 100005
        + line 100006
        + line 100007
        + line 100008
        + line 100009

If you want the line number, use enumerate:

如果您想要行号，请使用枚举：

with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())    
    for i,line in enumerate(diff):
        if line.startswith(' '):
            continue
        sys.stdout.write('My count: {}, text: {}'.format(i,line))

Answer 2

回答by John La Rooy

7.4. difflib— Helpers for computing deltas

7.4. difflib— 计算增量的助手

New in version 2.1.

2.1 版中的新功能。

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

该模块提供用于比较序列的类和函数。例如，它可以用于比较文件，并且可以生成各种格式的差异信息，包括 HTML 和上下文以及统一差异。要比较目录和文件，另请参见 filecmp 模块。

Answer 3

回答by korylprince

As long as you don't care about order you could use:

只要您不关心订单，您就可以使用：

with open('file1') as f:
    t1 = f.read().splitlines()
    t1s = set(t1)

with open('file2') as f:
    t2 = f.read().splitlines()
    t2s = set(t2)

#in file1 but not file2
print "Only in file1"
for diff in t1s-t2s:
    print t1.index(diff), diff

#in file2 but not file1
print "Only in file2"
for diff in t2s-t1s:
    print t2.index(diff), diff

Edit: If you do care about order and they're mostly the same then why not just use the command diff?

编辑：如果您确实关心顺序并且它们几乎相同，那么为什么不直接使用该命令diff呢？

返回两个文件之间不同的行（Python）

提问by user2597879

采纳答案by dawg

回答by John La Rooy

7.4. difflib— Helpers for computing deltas

7.4. difflib— 计算增量的助手

回答by korylprince

相关推荐

最近更新

标签

返回两个文件之间不同的行（Python）

提问by user2597879

采纳答案by dawg

回答by John La Rooy

7.4. difflib— Helpers for computing deltas

7.4. difflib— 计算增量的助手

回答by korylprince

相关推荐

Python datetime.fromtimestamp vs datetime.utcfromtimestamp，哪个更安全？

Python 为什么我的递归函数返回 None？

Python 在熊猫中按索引+列分组

Python 从 Pandas 的大型相关矩阵中列出最高相关对？

相关推荐

最近更新

标签