Python 在 csv 文件中查找重复项的脚本

Question

提问by

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

我有一个 40 MB 的 csv 文件，其中有 50,000 条记录。它是一个巨大的产品列表。每行有近 20 个字段。[商品编号、UPC、描述等]

How can I,

我怎样才能，

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

a) 查找并打印重复的行。[这个文件是一个很大的附加文件，所以我需要删除文件中包含的多个标题，所以我想知道首先重复的确切行。]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

b) 根据一列查找并打印重复的行。[查看一个UPC是否分配给多个产品]

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

我需要在服务器上运行命令或脚本，并且安装了 Perl 和 Python。甚至 bash 脚本或命令也适用于我。

I dont need to preserve the order of the rows. etc

我不需要保留行的顺序。等等

I tried,

我试过，

sort largefile.csv | uniq -d

排序 largefile.csv | uniq -d

to get the duplicates, But I am not getting the expected answer.

得到重复，但我没有得到预期的答案。

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

理想情况下，我想要 bash 脚本或命令，但如果有人有任何其他建议，那也很棒。

Thanks

谢谢

See: Remove duplicate rows from a large file in Pythonover on Stack Overflow

请参阅：从 Python 中的大文件中删除重复的行在堆栈上

Answer 1

回答by Benoit

You could possibly use SQLite shell to import your csv file and create indexes to perform SQL commands faster.

您可以使用 SQLite shell 导入您的 csv 文件并创建索引以更快地执行 SQL 命令。

Answer 2

回答by mob

Find and print duplicate rows in Perl:

在 Perl 中查找并打印重复的行：

perl -ne 'print if $SEEN{$_}++' < input-file

Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:

在 Perl 中查找和打印具有重复列的行——假设字段用逗号分隔的第 5 列：

perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file

Answer 3

回答by MkV

For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:

对于第二部分：将带有 Text::CSV 的文件读取到以您的唯一键为键的哈希中，在添加之前检查哈希值是否存在。像这样的东西：

data (doesn't need to be sorted), in this example we need the first two columns to be unique:

数据（不需要排序），在这个例子中，我们需要前两列是唯一的：

1142,X426,Name1,Thing1
1142,X426,Name2,Thing2
1142,X426,Name3,Thing3
1142,X426,Name4,Thing4
1144,X427,Name5,Thing5
1144,X427,Name6,Thing6
1144,X427,Name7,Thing7
1144,X427,Name8,Thing8

code:

代码：

use strict;
use warnings;
use Text::CSV;

my %data;
my %dupes;
my @rows;
my $csv = Text::CSV->new ()
                        or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<", "data.csv" or die "data.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    # insert row into row list  
    push @rows, $row;
    # join the unique keys with the
    # perl 'multidimensional array emulation' 
    # subscript  character
    my $key = join( $;, @{$row}[0,1] ); 
    # if it was just one field, just use
    # my $key = $row->[$keyfieldindex];
    # if you were checking for full line duplicates (header lines):
    # my $key = join($;, @$row);
    # if %data has an entry for the record, add it to dupes
    if (exists $data{$key}) { # duplicate 
        # if it isn't already duplicated
        # add this row and the original 
        if (not exists $dupes{$key}) {
            push @{$dupes{$key}}, $data{$key};
        }
        # add the duplicate row
        push @{$dupes{$key}}, $row;
    } else {
        $data{ $key } = $row;
    }
}

$csv->eof or $csv->error_diag();
close $fh;
# print out duplicates:
warn "Duplicate Values:\n";
warn "-----------------\n";
foreach my $key (keys %dupes) {
    my @keys = split($;, $key);
    warn "Key: @keys\n";
    foreach my $dupe (@{$dupes{$key}}) {
        warn "\tData: @$dupe\n";
    }
}

Which prints out something like this:

它打印出这样的东西：

Duplicate Values:
-----------------
Key: 1142 X426
    Data: 1142 X426 Name1 Thing1
    Data: 1142 X426 Name2 Thing2
    Data: 1142 X426 Name3 Thing3
    Data: 1142 X426 Name4 Thing4
Key: 1144 X427
    Data: 1144 X427 Name5 Thing5
    Data: 1144 X427 Name6 Thing6
    Data: 1144 X427 Name7 Thing7
    Data: 1144 X427 Name8 Thing8

Answer 4

回答by Morten

Try the following:

请尝试以下操作：

# Sort before using the uniq command
sort largefile.csv | sort | uniq -d

uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.

uniq 是一个非常基本的命令，只报告彼此相邻的唯一性/重复项。

Answer 5

回答by RousseauAlexandre

Here my (very simple) script to do it with Ruby & Rake Gem.

这是我的（非常简单的）脚本，用于使用 Ruby 和 Rake Gem 来完成。

First create a RakeFileand write this code:

首先创建一个RakeFile并编写以下代码：

namespace :csv do
  desc "find duplicates from CSV file on given column"
  task :double, [:file, :column] do |t, args|
    args.with_defaults(column: 0)
    values = []
    index  = args.column.to_i
    # parse given file row by row
    File.open(args.file, "r").each_slice(1) do |line|
      # get value of the given column
      values << line.first.split(';')[index]
    end
    # compare length with & without uniq method 
    puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
  end
end

Then to use it on the first column

然后在第一列上使用它

$ rake csv:double["2017.04.07-Export.csv"] 
File does not contain duplicates

And to use it on the second (for example)

并在第二个（例如）上使用它

$ rake csv:double["2017.04.07-Export.csv",1] 
File contains duplicates

Python 在 csv 文件中查找重复项的脚本

提问by

回答by Benoit

回答by mob

回答by MkV

回答by Morten

回答by RousseauAlexandre

相关推荐

最近更新

标签

Python 在 csv 文件中查找重复项的脚本

提问by

回答by Benoit

回答by mob

回答by MkV

回答by Morten

回答by RousseauAlexandre

相关推荐

Python：NameError：未定义全局名称“foobar”

用逗号分割并在 Python 中去除空格

Python Turtle 模块 - 保存图像

Python BeautifulSoup XML 解析

相关推荐

最近更新

标签