bash 从一个文件中删除另一个文件中的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4780203/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:06:49  来源:igfitidea点击:

Deleting lines from one file which are in another file

bashscriptingsh

提问by lalli

I have a file f1:

我有一个文件f1

line1
line2
line3
line4
..
..

I want to delete all the lines which are in another file f2:

我想删除另一个文件中的所有行f2

line2
line8
..
..

I tried something with catand sed, which wasn't even close to what I intended. How can I do this?

我用catand尝试了一些东西sed,这甚至不是我想要的。我怎样才能做到这一点?

回答by gabuzo

grep -v -x -f f2 f1should do the trick.

grep -v -x -f f2 f1应该做的伎俩。

Explanation:

解释:

  • -vto select non-matching lines
  • -xto match whole lines only
  • -f f2to get patterns from f2
  • -v选择不匹配的行
  • -x只匹配整行
  • -f f2从中获取模式 f2

One can instead use grep -For fgrepto match fixed stringsfrom f2rather than patterns(in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2as regex patterns).

可以改为使用grep -Ffgrep匹配来自而不是模式的固定字符串(如果您想以“所见即所得”的方式删除行,而不是将行视为正则表达式模式)。f2f2

回答by Ignacio Vazquez-Abrams

Try comm instead (assuming f1 and f2 are "already sorted")

试试 comm (假设 f1 和 f2 “已经排序”)

comm -2 -3 f1 f2

回答by Paused until further notice.

For exclude files that aren't too huge, you can use AWK's associative arrays.

对于不太大的排除文件,您可以使用 AWK 的关联数组。

awk 'NR == FNR { list[tolower(
#!/usr/bin/env ruby 
b=File.read("file2").split
open("file1").each do |x|
  x.chomp!
  puts x if !b.include?(x)
end
)]=1; next } { if (! list[tolower(
b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
)]) print }' exclude-these.txt from-this.txt

The output will be in the same order as the "from-this.txt" file. The tolower()function makes it case-insensitive, if you need that.

输出的顺序与“from-this.txt”文件的顺序相同。tolower()如果需要,该函数使其不区分大小写。

The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

算法复杂度可能是 O(n) (exclude-these.txt 大小) + O(n)(from-this.txt 大小)

回答by jcsahnwaldt says GoFundMonica

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNRtrick):

类似于丹尼斯威廉姆森的回答(主要是语法变化,例如明确设置文件编号而不是NR == FNR技巧):

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

Accessing r[$0]creates the entry for that line, no need to set a value.

访问r[$0]为该行创建条目,无需设置值。

Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

假设 awk 使用具有恒定查找和(平均)恒定更新时间的哈希表,其时间复杂度将为 O(n + m),其中 n 和 m 是文件的长度。就我而言,n 为 ~2500 万,m 为 ~14000。awk 解决方案比 sort 快得多,而且我也更喜欢保持原始顺序。

回答by kurumi

if you have Ruby (1.9+)

如果你有 Ruby (1.9+)

$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test

real    0m0.639s
user    0m0.554s
sys     0m0.021s

$time sort file1 file2|uniq -u  > sort.test

real    0m2.311s
user    0m1.959s
sys     0m0.040s

$ diff <(sort -n ruby.test) <(sort -n sort.test)
$

Which has O(N^2) complexity. If you want to care about performance, here's another version

其中具有 O(N^2) 复杂度。如果你想关心性能,这是另一个版本

$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null

real    0m0.019s
user    0m0.023s
sys     0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null

real    0m0.026s
user    0m0.018s
sys     0m0.007s
$ time grep -xvf f2 f1 > /dev/null

real    0m43.197s
user    0m43.155s
sys     0m0.040s

which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

它使用散列来实现减法,因此复杂度 O(n)(a 的大小)+ O(n)(b 的大小)

here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

这是一个小基准,由 user576875 提供,但有 100K 行,其中:

echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a

diffwas used to show there are no differences between the 2 files generated.

diff用于显示生成的 2 个文件之间没有差异。

回答by Lri

Some timing comparisons between various other answers:

各种其他答案之间的一些时间比较:

create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q

sort f1 f2 | uniq -uisn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

sort f1 f2 | uniq -u甚至不是对称差异,因为它删除了在任一文件中多次出现的行。

comm can also be used with stdin and here strings:

comm 也可以与 stdin 和这里的字符串一起使用:

sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh

sed -i 's#$#%%g'"'"' f1#g' f2.sh

sed -i '1i#!/bin/bash' f2.sh

sh f2.sh

回答by Benoit

Seems to be a job suitable for the SQLite shell:

似乎是一个适合 SQLite shell 的工作:

##代码##

回答by Ruan

Did you try thiswith sed?

你用sed试过这个吗?

##代码##

回答by youngrrrr

Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.

不是“编程”答案,但这是一个快速而肮脏的解决方案:只需访问http://www.listdiff.com/compare-2-lists-difference-tool

Obviously won't work for huge files but it did the trick for me. A few notes:

显然不适用于大文件,但它对我有用。一些注意事项:

  • I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
  • The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
  • 我与该网站没有任何关联(如果您仍然不相信我,那么您可以在线搜索其他工具;我使用了搜索词“在线设置差异列表”)
  • 链接的网站似乎在每次列表比较时都会进行网络调用,因此不要向其提供任何敏感数据