bash 在大文本文件中查找重复记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16347775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 05:20:17  来源:igfitidea点击:

Find duplicate records in large text file

pythonlinuxbashshell

提问by Justin Kredible

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

我在一台 linux 机器(Redhat)上,我有一个 11GB 的文本文件。文本文件中的每一行都包含单个记录的数据,该行的前 n 个字符包含该记录的唯一标识符。该文件包含略多于 2700 万条记录。

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.

我需要验证文件中没有多个具有相同唯一标识符的记录。我还需要对 80GB 文本文件执行此过程,因此任何需要将整个文件加载到内存中的解决方案都不实用。

采纳答案by Roland Smith

Read the file line-by-line, so you don't have to load it all into memory.

逐行读取文件,因此您不必将其全部加载到内存中。

For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.

除非您的标识符较短,否则为每一行(记录)创建一个 sha256 哈希(32 字节)。

Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.

将散列/标识符存储在numpy.array. 这可能是最紧凑的存储方式。2700 万条记录乘以 32 字节/哈希为 864 MB。这应该适合这些天像样的机器的记忆。

To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdictand put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.

为了加快访问速度,您可以使用散列的前例如 2 个字节作为 a 的键,collections.defaultdict并将其余散列放在值中的列表中。这实际上会创建一个包含 65536 个存储桶的哈希表。对于 27e6 条记录,每个桶将平均包含大约 400 个条目的列表。这意味着比 numpy 数组更快的搜索,但它会使用更多的内存。

d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:
    for line in datafile:
        id = hashlib.sha256(line).digest()
        # Or id = line[:n]
        k = id[0:2]
        v = id[2:]
        if v in d[k]:
            print "double found:", id
        else:
            d[k].append(v)

回答by 9000

Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.

这项工作的正确工具:将您的记录放入数据库。除非您已经手边有 Postgres 或 MySQL 安装,否则我会使用 sqlite。

$ sqlite3 uniqueness.sqlite
create table chk (
  ident char(n), -- n as in first n characters
  lineno integer -- for convenience
);
^D

Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:

然后我将唯一标识符和行号插入到该表中,可能使用这样的 Python 脚本:

import sqlite3 # install pysqlite3 before this
n = ... # how many chars are in the key part
lineno = 0

conn = sqlite3.connect("uniqueness.sqlite")
cur = conn.cursor()
with open("giant-file") as input:
  for line in input:
    lineno +=1
    ident = line[:n]
    cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno])
cur.close()
conn.close()

After this, you can index the table and use SQL:

在此之后,您可以索引表并使用 SQL:

$ sqlite3 uniqueness.sqlite
create index x_ident on chk(ident); -- may take a bit of time

-- quickly find duplicates, if any
select ident, count(ident) as how_many
from chk
group by ident
having count(ident) > 1;

-- find lines of specific violations, if needed
select lineno 
from chk
where ident = ...; -- insert a duplicate ident

Yes, I tried most of this code, it should work :)

是的,我尝试了大部分代码,它应该可以工作:)

回答by eandersson

I would never recommend that you try to filter such a massive text file in Python. It does not matter how you tackle it you will need to go through some complicated steps to make sure that you do not run out of memory.

我永远不会建议您尝试在 Python 中过滤如此庞大的文本文件。无论您如何解决它,您都需要执行一些复杂的步骤以确保不会耗尽内存。

The first thing that comes to mind is creating a hash of the lines and then using the hash to find duplicates. Since you save the line number as well you can then directly compare the text to make sure that there are no hash collisions.

首先想到的是创建行的散列,然后使用散列查找重复项。由于您也保存了行号,因此您可以直接比较文本以确保没有哈希冲突。

But, the easiest solution would be to convert the text file into a database that allows you to quickly sort, search and filter out duplicate items. You can then re-create the text file using that if that is really a requirement.

但是,最简单的解决方案是将文本文件转换为允许您快速排序、搜索和过滤掉重复项的数据库。如果确实需要,您可以使用该文件重新创建文本文件。

回答by 0xhughes

Read large text files in Python, line by line without loading it in to memory

在 Python 中逐行读取大型文本文件,无需将其加载到内存中

The answer to that question was this,

这个问题的答案是这样的,

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

Perhaps that will help you somehow, Good luck.

也许这会以某种方式帮助你,祝你好运。

回答by John

Assuming I couldn't use a database I'd try something like

假设我不能使用数据库,我会尝试类似

# read the file one line at a time http://stackoverflow.com/a/6475407/322909,
#be sure to read the comments

keys = set()

with open("bigfile.txt") as f:
    for line in f:
        key = get_key(line)
        if key in keys:
            print "dup"
        else:
            keys.add(key)

回答by rzymek

Try this:

尝试这个:

n=unique identifier size
cat 11gb_file | cut -c-$n | sort | uniq -cd

This will output any duplicate identifiers and how many times they appeared.

这将输出任何重复的标识符以及它们出现的次数。

回答by tink

I haven't tried this on a file quite that large, but ... assuming that the fixed position of the n characters were 7, and that the lines aren't longer than 999+7 characters this might do the job:

我还没有在一个相当大的文件上尝试过这个,但是......假设 n 个字符的固定位置是 7,并且行不长于 999+7 个字符,这可能会完成这项工作:

 awk  'BEGIN{FIELDWIDTHS="7 999"} ! a[]++' file > newfile