使用 python 脚本从 csv 文件中删除重复的行

Question

提问by IcyFlame

Goal

目标

I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them.

我从 hotmail 下载了一个 CSV 文件，但里面有很多重复的文件。这些副本是完整的副本，我不知道为什么我的手机会创建它们。

I want to get rid of the duplicates.

我想摆脱重复。

Approach

方法

Write a python script to remove duplicates.

编写一个python脚本来删除重复项。

Technical specification

技术规格

Windows XP SP 3
Python 2.7
CSV file with 400 contacts

Answer 1

采纳答案by jamylak

UPDATE: 2016

更新：2016

If you are happy to use the helpful more_itertoolsexternal library:

如果您乐于使用有用的more_itertools外部库：

from more_itertools import unique_everseen
with open('1.csv','r') as f, open('2.csv','w') as out_file:
    out_file.writelines(unique_everseen(f))

A more efficient version of @IcyFlame's solution

@IcyFlame 解决方案的更高效版本

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

To edit the same file in-place you could use this

要就地编辑相同的文件，您可以使用它

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file

Answer 2

回答by IcyFlame

You can use the following script:

您可以使用以下脚本：

pre-condition:

前提：

1.csvis the file that consists the duplicates
2.csvis the output file that will be devoid of the duplicates once this script is executed.

1.csv是包含重复项的文件
2.csv是执行此脚本后将没有重复项的输出文件。

code

代码



inFile = open('1.csv','r')

outFile = open('2.csv','w')

listLines = []

for line in inFile:

    if line in listLines:
        continue

    else:
        outFile.write(line)
        listLines.append(line)

outFile.close()

inFile.close()

Algorithm Explanation

算法说明

Here, what I am doing is:

在这里，我正在做的是：

opening a file in the read mode. This is the file that has the duplicates.
Then in a loop that runs till the file is over, we check if the line has already encountered.
If it has been encountered than we don't write it to the output file.
If not we will write it to the output file and add it to the list of records that have been encountered already

以读取模式打开文件。这是具有重复项的文件。
然后在运行直到文件结束的循环中，我们检查该行是否已经遇到。
如果遇到它，我们不将其写入输出文件。
如果没有，我们会将其写入输出文件并将其添加到已经遇到的记录列表中

Answer 3

回答by Ahmed Abdelkafi

A more efficient version of @jamylak's solution: (with one less instruction)

@jamylak 解决方案的更高效版本：（少了一条指令）

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: 
            seen.add(line)
            out_file.write(line)

To edit the same file in-place you could use this

要就地编辑相同的文件，您可以使用它

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line not in seen:
        seen.add(line)
        print line, # standard output is now redirected to the file

Answer 4

回答by Andrei Sura

you can achieve deduplicaiton efficiently using Pandas:

您可以使用 Pandas 高效地实现重复数据删除：

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="\t or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output)

Answer 5

回答by Dulangi_Kanchana

You can do using pandas library in jupyter notebook or relevant IDE, I m importing pandas to jupyter notebook and reading the csv file

您可以在 jupyter notebook 或相关 IDE 中使用 pandas 库，我正在将 pandas 导入 jupyter notebook 并读取 csv 文件

Then sort the values,accordingly by which parameters duplicates are present, since I have defined two attributes first it will sort by time, then by latitude

然后对值进行排序，根据存在哪些重复参数，因为我首先定义了两个属性，它将按时间排序，然后按纬度排序

Then remove duplicates as present in time column or column relevant as per you

然后根据您删除时间列或相关列中存在的重复项

Then i store the duplicates removed and sorted file as gps_sorted

然后我将删除的重复项和排序文件存储为 gps_sorted

import pandas as pd
stock=pd.read_csv("C:/Users/Donuts/GPS Trajectory/go_track_trackspoints.csv")
stock2=stock.sort_values(["time","latitude"],ascending=True)
stock2.drop_duplicates(subset=['time'])
stock2.to_csv("C:/Users/Donuts/gps_sorted.csv",)

Hope this helps

希望这可以帮助

Answer 6

回答by Ongati Felix

I know this is long settled, but I have had a closely related problem whereby I was to remove duplicates based on one column. The input csv file was quite large to be opened on my pc by MS Excel/Libre Office Calc/Google Sheets; 147MB with about 2.5 million records. Since I did not want to install a whole external library for such a simple thing, I wrote the python script below to do the job in less than 5 minutes. I didn't focus on optimization, but I believe it can be optimized to run faster and more efficient for even bigger files. The algorithm is similar to @IcyFlame above, except that I am removing duplicates based on a column ('CCC') instead of whole row/line.

我知道这早就解决了，但是我遇到了一个密切相关的问题，即我要删除基于一列的重复项。输入的 csv 文件非常大，无法通过 MS Excel/Libre Office Calc/Google Sheets 在我的电脑上打开；147MB 大约有 250 万条记录。由于我不想为这么简单的事情安装整个外部库，所以我编写了下面的 python 脚本，以在不到 5 分钟的时间内完成这项工作。我没有专注于优化，但我相信它可以优化为更大的文件运行得更快、更高效。该算法类似于上面的@IcyFlame，除了我基于列（'CCC'）而不是整行/行删除重复项。

import csv

with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
    # this list will hold unique ccc numbers,
    ccc_numbers = []
    # read input file into a dictionary, there were some null bytes in the infile
    results = csv.DictReader(infile)
    writer = csv.writer(outfile)

    # write column headers to output file
    writer.writerow(
        ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
    )
    for result in results:
        ccc_number = result.get('CCC')
        # if value already exists in the list, skip writing it whole row to output file
        if ccc_number in ccc_numbers:
            continue
        writer.writerow([
            result.get('ID'),
            ccc_number,
            result.get('MFLCode'),
            result.get('datecollected'),
            result.get('DateTested'),
            result.get('Result'),
            result.get('Justification')
        ])

        # add the value to the list to so as to be skipped subsequently
        ccc_numbers.append(ccc_number)

使用 python 脚本从 csv 文件中删除重复的行

提问by IcyFlame

采纳答案by jamylak

回答by IcyFlame

回答by Ahmed Abdelkafi

回答by Andrei Sura

回答by Dulangi_Kanchana

回答by Ongati Felix

相关推荐

最近更新

标签

使用 python 脚本从 csv 文件中删除重复的行

提问by IcyFlame

采纳答案by jamylak

回答by IcyFlame

回答by Ahmed Abdelkafi

回答by Andrei Sura

回答by Dulangi_Kanchana

回答by Ongati Felix

相关推荐

使用 requests.post() 时 Python 中的“SyntaxError: non-keyword arg after keyword arg”错误

带有浮点数的字符串格式中的 Python 精度

Python Pandas - 创建一个列数据类型对象或因子

“python setup.py install”和“pip install”之间的区别

相关推荐

最近更新

标签