使用 python 脚本从 csv 文件中删除重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15741564/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing duplicate rows from a csv file using a python script
提问by IcyFlame
Goal
目标
I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them.
我从 hotmail 下载了一个 CSV 文件,但里面有很多重复的文件。这些副本是完整的副本,我不知道为什么我的手机会创建它们。
I want to get rid of the duplicates.
我想摆脱重复。
Approach
方法
Write a python script to remove duplicates.
编写一个python脚本来删除重复项。
Technical specification
技术规格
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
采纳答案by jamylak
UPDATE: 2016
更新:2016
If you are happy to use the helpful more_itertoolsexternal library:
如果您乐于使用有用的more_itertools外部库:
from more_itertools import unique_everseen
with open('1.csv','r') as f, open('2.csv','w') as out_file:
out_file.writelines(unique_everseen(f))
A more efficient version of @IcyFlame's solution
@IcyFlame 解决方案的更高效版本
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
To edit the same file in-place you could use this
要就地编辑相同的文件,您可以使用它
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue # skip duplicate
seen.add(line)
print line, # standard output is now redirected to the file
回答by IcyFlame
You can use the following script:
您可以使用以下脚本:
pre-condition:
前提:
1.csvis the file that consists the duplicates2.csvis the output file that will be devoid of the duplicates once this script is executed.
1.csv是包含重复项的文件2.csv是执行此脚本后将没有重复项的输出文件。
code
代码
inFile = open('1.csv','r')
outFile = open('2.csv','w')
listLines = []
for line in inFile:
if line in listLines:
continue
else:
outFile.write(line)
listLines.append(line)
outFile.close()
inFile.close()
Algorithm Explanation
算法说明
Here, what I am doing is:
在这里,我正在做的是:
- opening a file in the read mode. This is the file that has the duplicates.
- Then in a loop that runs till the file is over, we check if the line has already encountered.
- If it has been encountered than we don't write it to the output file.
- If not we will write it to the output file and add it to the list of records that have been encountered already
- 以读取模式打开文件。这是具有重复项的文件。
- 然后在运行直到文件结束的循环中,我们检查该行是否已经遇到。
- 如果遇到它,我们不将其写入输出文件。
- 如果没有,我们会将其写入输出文件并将其添加到已经遇到的记录列表中
回答by Ahmed Abdelkafi
A more efficient version of @jamylak's solution: (with one less instruction)
@jamylak 解决方案的更高效版本:(少了一条指令)
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line not in seen:
seen.add(line)
out_file.write(line)
To edit the same file in-place you could use this
要就地编辑相同的文件,您可以使用它
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line not in seen:
seen.add(line)
print line, # standard output is now redirected to the file
回答by Andrei Sura
you can achieve deduplicaiton efficiently using Pandas:
您可以使用 Pandas 高效地实现重复数据删除:
import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"
df = pd.read_csv(file_name, sep="\t or ,")
# Notes:
# - the `subset=None` means that every column is used
# to determine if two rows are different; to change that specify
# the columns as an array
# - the `inplace=True` means that the data structure is changed and
# the duplicate rows are gone
df.drop_duplicates(subset=None, inplace=True)
# Write the results to a different file
df.to_csv(file_name_output)
回答by Dulangi_Kanchana
You can do using pandas library in jupyter notebook or relevant IDE, I m importing pandas to jupyter notebook and reading the csv file
您可以在 jupyter notebook 或相关 IDE 中使用 pandas 库,我正在将 pandas 导入 jupyter notebook 并读取 csv 文件
Then sort the values,accordingly by which parameters duplicates are present, since I have defined two attributes first it will sort by time, then by latitude
然后对值进行排序,根据存在哪些重复参数,因为我首先定义了两个属性,它将按时间排序,然后按纬度排序
Then remove duplicates as present in time column or column relevant as per you
然后根据您删除时间列或相关列中存在的重复项
Then i store the duplicates removed and sorted file as gps_sorted
然后我将删除的重复项和排序文件存储为 gps_sorted
import pandas as pd
stock=pd.read_csv("C:/Users/Donuts/GPS Trajectory/go_track_trackspoints.csv")
stock2=stock.sort_values(["time","latitude"],ascending=True)
stock2.drop_duplicates(subset=['time'])
stock2.to_csv("C:/Users/Donuts/gps_sorted.csv",)
Hope this helps
希望这可以帮助
回答by Ongati Felix
I know this is long settled, but I have had a closely related problem whereby I was to remove duplicates based on one column. The input csv file was quite large to be opened on my pc by MS Excel/Libre Office Calc/Google Sheets; 147MB with about 2.5 million records. Since I did not want to install a whole external library for such a simple thing, I wrote the python script below to do the job in less than 5 minutes. I didn't focus on optimization, but I believe it can be optimized to run faster and more efficient for even bigger files. The algorithm is similar to @IcyFlame above, except that I am removing duplicates based on a column ('CCC') instead of whole row/line.
我知道这早就解决了,但是我遇到了一个密切相关的问题,即我要删除基于一列的重复项。输入的 csv 文件非常大,无法通过 MS Excel/Libre Office Calc/Google Sheets 在我的电脑上打开;147MB 大约有 250 万条记录。由于我不想为这么简单的事情安装整个外部库,所以我编写了下面的 python 脚本,以在不到 5 分钟的时间内完成这项工作。我没有专注于优化,但我相信它可以优化为更大的文件运行得更快、更高效。该算法类似于上面的@IcyFlame,除了我基于列('CCC')而不是整行/行删除重复项。
import csv
with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
# this list will hold unique ccc numbers,
ccc_numbers = []
# read input file into a dictionary, there were some null bytes in the infile
results = csv.DictReader(infile)
writer = csv.writer(outfile)
# write column headers to output file
writer.writerow(
['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
)
for result in results:
ccc_number = result.get('CCC')
# if value already exists in the list, skip writing it whole row to output file
if ccc_number in ccc_numbers:
continue
writer.writerow([
result.get('ID'),
ccc_number,
result.get('MFLCode'),
result.get('datecollected'),
result.get('DateTested'),
result.get('Result'),
result.get('Justification')
])
# add the value to the list to so as to be skipped subsequently
ccc_numbers.append(ccc_number)

