bash 从一个文件中删除另一个文件中的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4780203/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Deleting lines from one file which are in another file
提问by lalli
I have a file f1
:
我有一个文件f1
:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2
:
我想删除另一个文件中的所有行f2
:
line2
line8
..
..
I tried something with cat
and sed
, which wasn't even close to what I intended. How can I do this?
我用cat
and尝试了一些东西sed
,这甚至不是我想要的。我怎样才能做到这一点?
回答by gabuzo
grep -v -x -f f2 f1
should do the trick.
grep -v -x -f f2 f1
应该做的伎俩。
Explanation:
解释:
-v
to select non-matching lines-x
to match whole lines only-f f2
to get patterns fromf2
-v
选择不匹配的行-x
只匹配整行-f f2
从中获取模式f2
One can instead use grep -F
or fgrep
to match fixed stringsfrom f2
rather than patterns(in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2
as regex patterns).
可以改为使用grep -F
或fgrep
匹配来自而不是模式的固定字符串(如果您想以“所见即所得”的方式删除行,而不是将行视为正则表达式模式)。f2
f2
回答by Ignacio Vazquez-Abrams
Try comm instead (assuming f1 and f2 are "already sorted")
试试 comm (假设 f1 和 f2 “已经排序”)
comm -2 -3 f1 f2
回答by Paused until further notice.
For exclude files that aren't too huge, you can use AWK's associative arrays.
对于不太大的排除文件,您可以使用 AWK 的关联数组。
awk 'NR == FNR { list[tolower(#!/usr/bin/env ruby
b=File.read("file2").split
open("file1").each do |x|
x.chomp!
puts x if !b.include?(x)
end
)]=1; next } { if (! list[tolower(b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower()
function makes it case-insensitive, if you need that.
输出的顺序与“from-this.txt”文件的顺序相同。tolower()
如果需要,该函数使其不区分大小写。
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)
算法复杂度可能是 O(n) (exclude-these.txt 大小) + O(n)(from-this.txt 大小)
回答by jcsahnwaldt says GoFundMonica
Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR
trick):
类似于丹尼斯威廉姆森的回答(主要是语法变化,例如明确设置文件编号而不是NR == FNR
技巧):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0]
creates the entry for that line, no need to set a value.
访问r[$0]
为该行创建条目,无需设置值。
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.
假设 awk 使用具有恒定查找和(平均)恒定更新时间的哈希表,其时间复杂度将为 O(n + m),其中 n 和 m 是文件的长度。就我而言,n 为 ~2500 万,m 为 ~14000。awk 解决方案比 sort 快得多,而且我也更喜欢保持原始顺序。
回答by kurumi
if you have Ruby (1.9+)
如果你有 Ruby (1.9+)
$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test
real 0m0.639s
user 0m0.554s
sys 0m0.021s
$time sort file1 file2|uniq -u > sort.test
real 0m2.311s
user 0m1.959s
sys 0m0.040s
$ diff <(sort -n ruby.test) <(sort -n sort.test)
$
Which has O(N^2) complexity. If you want to care about performance, here's another version
其中具有 O(N^2) 复杂度。如果你想关心性能,这是另一个版本
$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null
real 0m0.019s
user 0m0.023s
sys 0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
real 0m0.026s
user 0m0.018s
sys 0m0.007s
$ time grep -xvf f2 f1 > /dev/null
real 0m43.197s
user 0m43.155s
sys 0m0.040s
which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)
它使用散列来实现减法,因此复杂度 O(n)(a 的大小)+ O(n)(b 的大小)
here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:
这是一个小基准,由 user576875 提供,但有 100K 行,其中:
echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
diff
was used to show there are no differences between the 2 files generated.
diff
用于显示生成的 2 个文件之间没有差异。
回答by Lri
Some timing comparisons between various other answers:
各种其他答案之间的一些时间比较:
create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q
sort f1 f2 | uniq -u
isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.
sort f1 f2 | uniq -u
甚至不是对称差异,因为它删除了在任一文件中多次出现的行。
comm can also be used with stdin and here strings:
comm 也可以与 stdin 和这里的字符串一起使用:
sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh
sed -i 's#$#%%g'"'"' f1#g' f2.sh
sed -i '1i#!/bin/bash' f2.sh
sh f2.sh
回答by Benoit
Seems to be a job suitable for the SQLite shell:
似乎是一个适合 SQLite shell 的工作:
##代码##回答by Ruan
Did you try thiswith sed?
你用sed试过这个吗?
##代码##回答by youngrrrr
Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.
不是“编程”答案,但这是一个快速而肮脏的解决方案:只需访问http://www.listdiff.com/compare-2-lists-difference-tool。
Obviously won't work for huge files but it did the trick for me. A few notes:
显然不适用于大文件,但它对我有用。一些注意事项:
- I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
- The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
- 我与该网站没有任何关联(如果您仍然不相信我,那么您可以在线搜索其他工具;我使用了搜索词“在线设置差异列表”)
- 链接的网站似乎在每次列表比较时都会进行网络调用,因此不要向其提供任何敏感数据