bash，Linux：设置两个文本文件之间的差异

Question

提问by Adam Matan

I have two files A-nodes_to_deleteand B-nodes_to_keep. Each file has a many lines with numeric ids.

我有两个文件A-nodes_to_delete和B- nodes_to_keep。每个文件都有许多带有数字 ID 的行。

I want to have the list of numeric ids that are in nodes_to_deletebut NOT in nodes_to_keep, e.g. .

我想要包含nodes_to_delete但不在中的数字 id 列表nodes_to_keep，例如.

Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?

在 PostgreSQL 数据库中执行此操作非常慢。使用 Linux CLI 工具在 bash 中执行此操作的任何巧妙方法？

UPDATE:This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq, sortand some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.

更新：这似乎是一个 Pythonic 的工作，但文件真的非常大。我已经解决了使用一些类似的问题uniq，sort一些集理论技术和。这比数据库等价物快大约两到三个数量级。

Answer 1

回答by msw

The commcommand does that.

该通讯命令做到这一点。

Answer 2

回答by slinkp

Somebody showed me how to do exactly this in sh a couple months ago, and then I couldn't find it for a while... and while looking I stumbled onto your question. Here it is :

几个月前有人向我展示了如何在 sh 中做到这一点，然后我有一段时间找不到它......在寻找时我偶然发现了你的问题。这里是：

set_union () {
   sort   | uniq
}

set_difference () {
   sort    | uniq -u
}

set_symmetric_difference() {
   sort   | uniq -u
}

Answer 3

回答by activedecay

Use comm- it will compare two sorted files line by line.

使用comm- 它将逐行比较两个排序的文件。

The short answer to your question

对您问题的简短回答

This command will return lines unique to deleteNodes, but not lines in keepNodes.

此命令将返回 deleteNodes 独有的行，但不会返回 keepNodes 中的行。

comm -1 -3 <(sort keepNodes) <(sort deleteNodes)

Example setup

示例设置

Let's create the files named keepNodesand deleteNodes, and use them as unsorted input for the commcommand.

让我们创建名为keepNodesand的文件deleteNodes，并将它们用作comm命令的未排序输入。

$ cat > keepNodes <(echo bob; echo amber;)
$ cat > deleteNodes <(echo bob; echo ann;)

By default, running comm without arguments prints 3 columns with this layout:

默认情况下，不带参数运行 comm 会打印具有以下布局的 3 列：

lines_unique_to_FILE1
    lines_unique_to_FILE2
        lines_which_appear_in_both

Using our example files above, run comm without arguments. Note the three columns.

使用我们上面的示例文件，不带参数运行 comm。注意三列。

$ comm <(sort keepNodes) <(sort deleteNodes)
amber
    ann
        bob

Suppressing column output

抑制列输出

Suppress column 1, 2 or 3 with -N; note that when a column is hidden, the whitespace shrinks up.

用 -N 抑制第 1、2 或 3 列；请注意，当一列被隐藏时，空白会缩小。

$ comm -1 <(sort keepNodes) <(sort deleteNodes)
ann
    bob
$ comm -2 <(sort keepNodes) <(sort deleteNodes)
amber
    bob
$ comm -3 <(sort keepNodes) <(sort deleteNodes)
amber
    ann
$ comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
ann
$ comm -2 -3 <(sort keepNodes) <(sort deleteNodes)
amber
$ comm -1 -2 <(sort keepNodes) <(sort deleteNodes)
bob

Sorting is important!

排序很重要！

If you execute comm without first sorting the file, it fails gracefully with a message about which file is not sorted.

如果您在没有先对文件进行排序的情况下执行 comm，它会正常失败并显示有关哪个文件未排序的消息。

comm: file 1 is not in sorted order

Answer 4

回答by John B

commwas specifically designed for this kind of use case, but it requires sorted input.

comm是专门为这种用例设计的，但它需要排序输入。

awkis arguably a better tool for this as it's fairly straight forward to find set difference, doesn't require sort, and offers additional flexibility.

awk可以说是一个更好的工具，因为它很容易找到集合差异，不需要sort，并提供额外的灵活性。

awk 'NR == FNR { a[awk -v r='^[0-9]+$' 'NR == FNR && #include<algorithm>
#include<iostream>
#include<iterator>
#include<fstream>
#include<string>
#include<unordered_set>

using namespace std;

int main(int argc, char** argv) {
    ifstream keep_file(argv[1]), del_file(argv[2]);
    unordered_multiset<string> init_lines{istream_iterator<string>(keep_file), istream_iterator<string>()};
    string line;
    while (getline(del_file, line)) {
        init_lines.erase(line);
    }
    copy(init_lines.begin(),init_lines.end(), ostream_iterator<string>(cout, "\n"));
}
 ~ r {
    a[g++ -O3 -march=native -xc++ -o set_diff - <<EOF
#include<algorithm>
#include<iostream>
#include<iterator>
#include<fstream>
#include<string>
#include<unordered_set>

using namespace std;

int main(int argc, char** argv) {
        ifstream keep_file(argv[1]), del_file(argv[2]);
        unordered_multiset<string> init_lines{istream_iterator<string>(keep_file), istream_iterator<string>()};
        string line;
        while (getline(del_file, line)) {
                init_lines.erase(line);
        }
        copy(init_lines.begin(),init_lines.end(), ostream_iterator<string>(cout, "\n"));
}
EOF
]
    next
} ##代码## ~ r && !(##代码## in a)' nodes_to_keep nodes_to_delete
]; next } !(##代码## in a)' nodes_to_keep nodes_to_delete

Perhaps, for example, you'd like to only find the difference in lines that represent non-negative numbers:

例如，也许您只想找到表示非负数的行中的差异：

##代码##

Answer 5

回答by Dark Castle

Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.

也许您需要一种更好的方法在 postgres 中做到这一点，我敢打赌，您不会找到使用平面文件更快的方法。你应该能够做一个简单的内部连接，并假设两个 id cols 都被索引，应该非常快。

Answer 6

回答by YenForYang

So, this is slightly different from the other answers. I can't say that a C++ compiler is exactly a "Linux CLI tool", but running g++ -O3 -march=native -o set_diff main.cpp(with the below code in main.cppcan do the trick):

所以，这与其他答案略有不同。我不能说 C++ 编译器完全是“Linux CLI 工具”，但可以运行g++ -O3 -march=native -o set_diff main.cpp（使用以下代码main.cpp可以解决问题）：

##代码##

To use, simply run set_diff B A(notA B, since Bis nodes_to_keep) and the resulting difference will be printed to stdout.

要使用，只需运行set_diff B A（notA B，因为Bis nodes_to_keep），结果差异将打印到标准输出。

Note that I've forgone a few C++ best practices to keep the code simpler.

请注意，为了使代码更简单，我放弃了一些 C++ 最佳实践。

Many additional speed optimizations could be made (at the price of more memory). mmapwould also be particularly useful for large data sets, but that'd make the code much more involved.

可以进行许多额外的速度优化（以更多内存为代价）。mmap对于大型数据集也特别有用，但这会使代码更加复杂。

Since you mentioned that the data sets are large, I thought that reading nodes_to_deletea line at a time might be a good idea to reduce memory consumption. The approach taken in the code above isn't particularly efficient if there are lots of dupes in your nodes_to_delete. Also, order is not preserved.

既然你提到数据集很大，我认为一次读取nodes_to_delete一行可能是减少内存消耗的好主意。如果您的nodes_to_delete. 此外，不保留顺序。

Something easier to copy and paste into bash(i.e. skipping creation of main.cpp):

更容易复制和粘贴的内容bash（即跳过的创建main.cpp）：

##代码##

bash，Linux：设置两个文本文件之间的差异

提问by Adam Matan

回答by msw

回答by slinkp

回答by activedecay

The short answer to your question

对您问题的简短回答

Example setup

示例设置

Suppressing column output

抑制列输出

Sorting is important!

排序很重要！

回答by John B

回答by Dark Castle

回答by YenForYang

相关推荐

最近更新

标签

bash，Linux：设置两个文本文件之间的差异

提问by Adam Matan

回答by msw

回答by slinkp

回答by activedecay

The short answer to your question

对您问题的简短回答

Example setup

示例设置

Suppressing column output

抑制列输出

Sorting is important!

排序很重要！

回答by John B

回答by Dark Castle

回答by YenForYang

相关推荐

将 makefile 变量值分配给 bash 命令结果？

通过 ssh 运行 Bash 脚本

While 循环以测试文件是否存在于 bash 中

bash 如何使用bash删除和替换终端中的最后一行？

相关推荐

最近更新

标签