如何从另一个文件 A 中删除出现在文件 B 上的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4366533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 23:56:57  来源:igfitidea点击:

How to remove the lines which appear on file B from another file A?

linuxshellseddiffgrep

提问by slhck

I have a large file A(consisting of emails), one line for each mail. I also have another file Bthat contains another set of mails.

我有一个大文件 A(由电子邮件组成),每封邮件一行。我还有另一个包含另一组邮件的文件 B。

Which command would I use to remove all the addresses that appear in file B from the file A.

我将使用哪个命令从文件 A 中删除出现在文件 B 中的所有地址。

So, if file A contained:

因此,如果文件 A 包含:

A
B
C

and file B contained:

和文件 B 包含:

B    
D
E

Then file A should be left with:

然后文件 A 应该留下:

A
C

Now I know this is a question that might have been asked more often, but I only found one command onlinethat gave me an error with a bad delimiter.

现在我知道这是一个可能经常被问到的问题,但我只在网上找到了一个命令,它给了我一个错误的错误分隔符。

Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.

任何帮助将非常感激!有人肯定会想出一个聪明的单线,但我不是贝壳专家。

采纳答案by The Archetypal Paul

If the files are sorted (they are in your example):

如果文件已排序(它们在您的示例中):

comm -23 file1 file2

-23suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sortfirst...

-23禁止在两个文件中或仅在文件 2 中的行。如果文件未排序,sort请先将它们通过管道...

See the man page here

请参阅此处手册页

回答by Paused until further notice.

Another way to do the same thing (also requires sorted input):

做同样事情的另一种方法(也需要排序输入):

join -v 1 fileA fileB

In Bash, if the files are not pre-sorted:

在 Bash 中,如果文件没有预先排序:

join -v 1 <(sort fileA) <(sort fileB)

回答by aec

You can do this unless your files are sorted

除非您的文件已排序,否则您可以这样做

diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a

--new-line-formatis for lines that are in file b but not in a --old-..is for lines that are in file a but not in b --unchanged-..is for lines that are in both. %Lmakes it so the line is printed exactly.

--new-line-format用于文件 b 中但不在 a --old-..中的行 用于文件 a 中但不在 b --unchanged-..中的行 用于两者中的行。 %L使该行准确打印。

man diff

for more details

更多细节

回答by karakfa

awk to the rescue!

awk 来救援!

This solution doesn't require sorted inputs. You have to provide fileB first.

此解决方案不需要排序输入。你必须先提供fileB。

awk 'NR==FNR{a[
A
C
];next} !(
$ awk '...' badwords allwords > goodwords
in a)' fileB fileA

returns

返回

$ awk 'NR==FNR{a[
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.

awk -v N=$N -v lookup="$LOOKUP" '
  BEGIN { while ( getline < lookup ) { dictionary[
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("file A", "r") as f:
    for line in [line.strip() for line in f.readlines()]:
        if line not in lines_to_remove:
            print(line)
'
]=
grep -vf file2 file1 
} } !($N in dictionary) {print}'
];next} !(
comm -1 -3 file2 file1
in a){print > FILENAME".clean"}' bad file1 file2 file3 ...

How does it work?

它是如何工作的?

NR==FNR{a[$0];next}idiom is for storing the first file in an associative array as keys for a later "contains" test.

NR==FNRis checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).

a[$0]adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)

!($0 in a)we're now in the next file(s), inis a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, !negates the condition. What is missing here is the action, which by default is {print}and usually not written explicitly.

NR==FNR{a[$0];next}idiom 用于将第一个文件存储在关联数组中作为稍后“包含”测试的键。

NR==FNR正在检查我们是否正在扫描第一个文件,其中全局行计数器 (NR) 等于当前文件行计数器 (FNR)。

a[$0]将当前行作为键添加到关联数组中,注意这就像一个集合,其中不会有任何重复的值(键)

!($0 in a)我们现在在下一个文件中, in是一个包含测试,这里它检查当前行是否在我们在第一个文件的第一步中填充的集合中,!否定条件。这里缺少的是动作,默认情况下{print},通常没有明确写出。

Note that this can now be used to remove blacklisted words.

请注意,这现在可用于删除列入黑名单的单词。

join -v1 -v2 file1 file2

with a slight change it can clean multiple lists and create cleaned versions.

稍作改动,它就可以清理多个列表并创建清理过的版本。

##代码##

回答by peak

This refinement of @karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.

对于非常大的文件,@karakfa 很好的答案的这种改进可能会明显更快。与该答案一样,两个文件都不需要排序,但速度可以通过 awk 的关联数组得到保证。只有查找文件保存在内存中。

This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.

此公式还允许在比较中仅使用输入文件中的一个特定字段 ($N)。

##代码##

(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)

(这种方法的另一个优点是很容易修改比较标准,例如修剪前导和尾随空格。)

回答by HelloGoodbye

You can use Python:

您可以使用 Python:

##代码##

回答by Darpan

You can use - diff fileA fileB | grep "^>" | cut -c3- > fileA

您可以使用 - diff fileA fileB | grep "^>" | cut -c3- > fileA

This will work for files that are not sorted as well.

这也适用于未排序的文件。

回答by Aakarsh Gupta

To remove common lines between two files you can use grep, comm or join command.

要删除两个文件之间的公共行,您可以使用 grep、comm 或 join 命令。

grep only works for small files. Use -v along with -f.

grep 仅适用于小文件。将 -v 与 -f 一起使用。

##代码##

This displays lines from file1 that do not match any line in file2.

这将显示 file1 中与 file2 中任何行都不匹配的行。

comm is a utility command that works on lexically sorted files. It takes two files as input and produces three text columns as output: lines only in the first file; lines only in the second file; and lines in both files. You can suppress printing of any column by using -1, -2 or -3 option accordingly.

comm 是一个实用命令,适用于按词法排序的文件。它接受两个文件作为输入并产生三个文本列作为输出:仅在第一个文件中的行;仅在第二个文件中的行;和两个文件中的行。您可以相应地使用 -1、-2 或 -3 选项来禁止打印任何列。

##代码##

This displays lines from file1 that do not match any line in file2.

这将显示 file1 中与 file2 中任何行都不匹配的行。

Finally, there is join, a utility command that performs an equality join on the specified files. Its -v option also allows to remove common lines between two files.

最后,还有 join,这是一个实用命令,用于对指定的文件执行相等连接。它的 -v 选项还允许删除两个文件之间的公共行。

##代码##