pandas 如何有效地删除python中数据帧或csv文件中的所有重复项？

Question

提问by Space

I have the table below contained in mytest.csv as below :

我在 mytest.csv 中包含下表，如下所示：

timestamp   val1    val2    user_id  val3  val4    val5    val6
01/01/2011  1   100 3    5     100     3       5
01/02/2013  20  8        6     12      15      3
01/07/2012      19  57   10    9       6       6        
01/11/2014  3100    49  6        12    15      3
21/12/2012          240  30    240     30       
01/12/2013          63                  
01/12/2013  3200    51  63       50

The above was obtained using the following code in which I tried to remove all duplicates but unfortunately some remained (based on 'timestamp' and 'user_id'):

以上是使用以下代码获得的，其中我试图删除所有重复项，但不幸的是仍有一些（基于“时间戳”和“用户 ID”）：

import pandas as pd

newnames = ['timestamp', 'val1', 'val2','val3', 'val4','val5', 'val6','user_id']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True) 
df = df.loc[:,['timestamp', 'user_id', 'val1', 'val2','val3', 'val4','val5', 'val6']]
df_clean = df.drop_duplicates().fillna(0)

Also, I would like to know how I can efficiently remove all duplicate from the data (pre-processing) and if I should do this before reading it into a dataframe. For example the two last rows are considered duplicates and only the last one which do not contain empty val1 (val1 = 3200) should remain in the dataframe.

另外，我想知道如何有效地从数据中删除所有重复项（预处理），以及是否应该在将其读入数据帧之前执行此操作。例如，最后两行被认为是重复的，只有不包含空 val1 (val1 = 3200) 的最后一行应保留在数据帧中。

Thanks in advance for your help.

在此先感谢您的帮助。

Answer 1

回答by joris

If you want to drop duplicates based on specific columns, you can use the subsetargument (older pandas versions: cols) in drop_duplicates:

如果您要根据特定列砸重复，你可以使用subset参数（较老版本的Pandas：cols）在drop_duplicates：

df_clean = df.drop_duplicates(subset=['timestamp', 'user_id'])

pandas 如何有效地删除python中数据帧或csv文件中的所有重复项？

提问by Space

回答by joris

相关推荐

最近更新

标签

pandas 如何有效地删除python中数据帧或csv文件中的所有重复项？

提问by Space

回答by joris

相关推荐

将 Pandas 数据集转换为数组以在 Scikit-Learn 中建模

pandas 从 Oracle 调用 Python

pandas 如何在 python ggplot 中创建条形图？

pandas 熊猫按日期索引求和和分组

相关推荐

最近更新

标签