bash 从一个文件中删除另一个文件中的行

Question

提问by lalli

I have a file f1:

我有一个文件f1：

line1
line2
line3
line4
..
..

I want to delete all the lines which are in another file f2:

我想删除另一个文件中的所有行f2：

line2
line8
..
..

I tried something with catand sed, which wasn't even close to what I intended. How can I do this?

我用catand尝试了一些东西sed，这甚至不是我想要的。我怎样才能做到这一点？

Answer 1

回答by gabuzo

grep -v -x -f f2 f1should do the trick.

grep -v -x -f f2 f1应该做的伎俩。

Explanation:

解释：

-vto select non-matching lines
-xto match whole lines only
-f f2to get patterns from f2

-v选择不匹配的行
-x只匹配整行
-f f2从中获取模式 f2

One can instead use grep -For fgrepto match fixed stringsfrom f2rather than patterns(in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2as regex patterns).

可以改为使用grep -F或fgrep匹配来自而不是模式的固定字符串（如果您想以“所见即所得”的方式删除行，而不是将行视为正则表达式模式）。f2f2

Answer 2

回答by Ignacio Vazquez-Abrams

Try comm instead (assuming f1 and f2 are "already sorted")

试试 comm （假设 f1 和 f2 “已经排序”）

comm -2 -3 f1 f2

Answer 3

回答by Paused until further notice.

For exclude files that aren't too huge, you can use AWK's associative arrays.

对于不太大的排除文件，您可以使用 AWK 的关联数组。

awk 'NR == FNR { list[tolower(#!/usr/bin/env ruby 
b=File.read("file2").split
open("file1").each do |x|
  x.chomp!
  puts x if !b.include?(x)
end
)]=1; next } { if (! list[tolower(b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
)]) print }' exclude-these.txt from-this.txt

The output will be in the same order as the "from-this.txt" file. The tolower()function makes it case-insensitive, if you need that.

输出的顺序与“from-this.txt”文件的顺序相同。tolower()如果需要，该函数使其不区分大小写。

The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

算法复杂度可能是 O(n) (exclude-these.txt 大小) + O(n)（from-this.txt 大小）

Answer 4

回答by jcsahnwaldt says GoFundMonica

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNRtrick):

类似于丹尼斯威廉姆森的回答（主要是语法变化，例如明确设置文件编号而不是NR == FNR技巧）：

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

Accessing r[$0]creates the entry for that line, no need to set a value.

访问r[$0]为该行创建条目，无需设置值。

Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

假设 awk 使用具有恒定查找和（平均）恒定更新时间的哈希表，其时间复杂度将为 O(n + m)，其中 n 和 m 是文件的长度。就我而言，n 为 ~2500 万，m 为 ~14000。awk 解决方案比 sort 快得多，而且我也更喜欢保持原始顺序。

Answer 5

回答by kurumi

if you have Ruby (1.9+)

如果你有 Ruby (1.9+)

$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test

real    0m0.639s
user    0m0.554s
sys     0m0.021s

$time sort file1 file2|uniq -u  > sort.test

real    0m2.311s
user    0m1.959s
sys     0m0.040s

$ diff <(sort -n ruby.test) <(sort -n sort.test)
$

Which has O(N^2) complexity. If you want to care about performance, here's another version

其中具有 O(N^2) 复杂度。如果你想关心性能，这是另一个版本

$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null

real    0m0.019s
user    0m0.023s
sys     0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null

real    0m0.026s
user    0m0.018s
sys     0m0.007s
$ time grep -xvf f2 f1 > /dev/null

real    0m43.197s
user    0m43.155s
sys     0m0.040s

which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

它使用散列来实现减法，因此复杂度 O(n)（a 的大小）+ O(n)（b 的大小）

here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

这是一个小基准，由 user576875 提供，但有 100K 行，其中：

echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a

diffwas used to show there are no differences between the 2 files generated.

diff用于显示生成的 2 个文件之间没有差异。

Answer 6

回答by Lri

Some timing comparisons between various other answers:

各种其他答案之间的一些时间比较：

create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q

sort f1 f2 | uniq -uisn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

sort f1 f2 | uniq -u甚至不是对称差异，因为它删除了在任一文件中多次出现的行。

comm can also be used with stdin and here strings:

comm 也可以与 stdin 和这里的字符串一起使用：

sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh

sed -i 's#$#%%g'"'"' f1#g' f2.sh

sed -i '1i#!/bin/bash' f2.sh

sh f2.sh

Answer 7

回答by Benoit

Seems to be a job suitable for the SQLite shell:

似乎是一个适合 SQLite shell 的工作：

##代码##

Answer 8

回答by Ruan

Did you try thiswith sed?

你用sed试过这个吗？

##代码##

Answer 9

回答by youngrrrr

Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.

不是“编程”答案，但这是一个快速而肮脏的解决方案：只需访问http://www.listdiff.com/compare-2-lists-difference-tool。

Obviously won't work for huge files but it did the trick for me. A few notes:

显然不适用于大文件，但它对我有用。一些注意事项：

I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data

我与该网站没有任何关联（如果您仍然不相信我，那么您可以在线搜索其他工具；我使用了搜索词“在线设置差异列表”）
链接的网站似乎在每次列表比较时都会进行网络调用，因此不要向其提供任何敏感数据

bash 从一个文件中删除另一个文件中的行

提问by lalli

回答by gabuzo

回答by Ignacio Vazquez-Abrams

回答by Paused until further notice.

回答by jcsahnwaldt says GoFundMonica

回答by kurumi

回答by Lri

回答by Benoit

回答by Ruan

回答by youngrrrr

相关推荐

最近更新

标签

bash 从一个文件中删除另一个文件中的行

提问by lalli

回答by gabuzo

回答by Ignacio Vazquez-Abrams

回答by Paused until further notice.

回答by jcsahnwaldt says GoFundMonica

回答by kurumi

回答by Lri

回答by Benoit

回答by Ruan

回答by youngrrrr

相关推荐

bash 如何在case语句中使用模式？

bash 如何递归查找目录中最新修改的文​​件？

bash 在 Git 上执行 shell 命令时如何指定要使用的私有 SSH 密钥？

如何使用 Bash 将一个目录合并到另一个目录中？

相关推荐

最近更新

标签

bash 如何递归查找目录中最新修改的文件？