Python 在 csv 文件中查找重复项的脚本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4095523/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 14:14:04  来源:igfitidea点击:

Script to find duplicates in a csv file

bashpythonperl

提问by

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

我有一个 40 MB 的 csv 文件,其中有 50,000 条记录。它是一个巨大的产品列表。每行有近 20 个字段。[商品编号、UPC、描述等]

How can I,

我怎样才能,

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

a) 查找并打印重复的行。[这个文件是一个很大的附加文件,所以我需要删除文件中包含的多个标题,所以我想知道首先重复的确切行。]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

b) 根据一列查找并打印重复的行。[查看一个UPC是否分配给多个产品]

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

我需要在服务器上运行命令或脚本,并且安装了 Perl 和 Python。甚至 bash 脚本或命令也适用于我。

I dont need to preserve the order of the rows. etc

我不需要保留行的顺序。等等

I tried,

我试过,

sort largefile.csv | uniq -d

排序 largefile.csv | uniq -d

to get the duplicates, But I am not getting the expected answer.

得到重复,但我没有得到预期的答案。

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

理想情况下,我想要 bash 脚本或命令,但如果有人有任何其他建议,那也很棒。

Thanks

谢谢



See: Remove duplicate rows from a large file in Pythonover on Stack Overflow

请参阅:从 Python 中的大文件中删除重复的行在堆栈上

回答by Benoit

You could possibly use SQLite shell to import your csv file and create indexes to perform SQL commands faster.

您可以使用 SQLite shell 导入您的 csv 文件并创建索引以更快地执行 SQL 命令。

回答by mob

Find and print duplicate rows in Perl:

在 Perl 中查找并打印重复的行:

perl -ne 'print if $SEEN{$_}++' < input-file

Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:

在 Perl 中查找和打印具有重复列的行——假设字段用逗号分隔的第 5 列:

perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file

回答by MkV

For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:

对于第二部分:将带有 Text::CSV 的文件读取到以您的唯一键为键的哈希中,在添加之前检查哈希值是否存在。像这样的东西:

data (doesn't need to be sorted), in this example we need the first two columns to be unique:

数据(不需要排序),在这个例子中,我们需要前两列是唯一的:

1142,X426,Name1,Thing1
1142,X426,Name2,Thing2
1142,X426,Name3,Thing3
1142,X426,Name4,Thing4
1144,X427,Name5,Thing5
1144,X427,Name6,Thing6
1144,X427,Name7,Thing7
1144,X427,Name8,Thing8

code:

代码:

use strict;
use warnings;
use Text::CSV;

my %data;
my %dupes;
my @rows;
my $csv = Text::CSV->new ()
                        or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<", "data.csv" or die "data.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    # insert row into row list  
    push @rows, $row;
    # join the unique keys with the
    # perl 'multidimensional array emulation' 
    # subscript  character
    my $key = join( $;, @{$row}[0,1] ); 
    # if it was just one field, just use
    # my $key = $row->[$keyfieldindex];
    # if you were checking for full line duplicates (header lines):
    # my $key = join($;, @$row);
    # if %data has an entry for the record, add it to dupes
    if (exists $data{$key}) { # duplicate 
        # if it isn't already duplicated
        # add this row and the original 
        if (not exists $dupes{$key}) {
            push @{$dupes{$key}}, $data{$key};
        }
        # add the duplicate row
        push @{$dupes{$key}}, $row;
    } else {
        $data{ $key } = $row;
    }
}

$csv->eof or $csv->error_diag();
close $fh;
# print out duplicates:
warn "Duplicate Values:\n";
warn "-----------------\n";
foreach my $key (keys %dupes) {
    my @keys = split($;, $key);
    warn "Key: @keys\n";
    foreach my $dupe (@{$dupes{$key}}) {
        warn "\tData: @$dupe\n";
    }
}

Which prints out something like this:

它打印出这样的东西:

Duplicate Values:
-----------------
Key: 1142 X426
    Data: 1142 X426 Name1 Thing1
    Data: 1142 X426 Name2 Thing2
    Data: 1142 X426 Name3 Thing3
    Data: 1142 X426 Name4 Thing4
Key: 1144 X427
    Data: 1144 X427 Name5 Thing5
    Data: 1144 X427 Name6 Thing6
    Data: 1144 X427 Name7 Thing7
    Data: 1144 X427 Name8 Thing8

回答by Morten

Try the following:

请尝试以下操作:

# Sort before using the uniq command
sort largefile.csv | sort | uniq -d

uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.

uniq 是一个非常基本的命令,只报告彼此相邻的唯一性/重复项。

回答by RousseauAlexandre

Here my (very simple) script to do it with Ruby & Rake Gem.

这是我的(非常简单的)脚本,用于使用 Ruby 和 Rake Gem 来完成。

First create a RakeFileand write this code:

首先创建一个RakeFile并编写以下代码:

namespace :csv do
  desc "find duplicates from CSV file on given column"
  task :double, [:file, :column] do |t, args|
    args.with_defaults(column: 0)
    values = []
    index  = args.column.to_i
    # parse given file row by row
    File.open(args.file, "r").each_slice(1) do |line|
      # get value of the given column
      values << line.first.split(';')[index]
    end
    # compare length with & without uniq method 
    puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
  end
end

Then to use it on the first column

然后在第一列上使用它

$ rake csv:double["2017.04.07-Export.csv"] 
File does not contain duplicates

And to use it on the second (for example)

并在第二个(例如)上使用它

$ rake csv:double["2017.04.07-Export.csv",1] 
File contains duplicates