pandas DataFrame.drop_duplicates 和 DataFrame.drop 不删除行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25695878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
DataFrame.drop_duplicates and DataFrame.drop not removing rows
提问by user3123955
I have read in a csv into a pandas dataframe and it has five columns. Certain rows have duplicate values only in the second column, i want to remove these rows from the dataframe but neither drop nor drop_duplicates is working.
我已经将 csv 读入了一个 Pandas 数据框,它有五列。某些行仅在第二列中具有重复值,我想从数据框中删除这些行,但 drop 和 drop_duplicates 都不起作用。
Here is my implementation:
这是我的实现:
#Read CSV
df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])
print Series(df.b)
dropRows = []
#Sanitize the data to get rid of duplicates
for indx, val in enumerate(df.b): #for all the values
if(indx == 0): #skip first indx
continue
if (val == df.b[indx-1]): #this is duplicate rtc value
dropRows.append(indx)
print dropRows
df.drop(dropRows) #this doesnt work
df.drop_duplicates('b') #this doesnt work either
print Series(df.b)
when i print out the series df.b before and after they are the same length and I can visibly see the duplicates still. is there something wrong in my implementation?
当我打印出相同长度之前和之后的系列 df.b 时,我仍然可以明显地看到重复项。我的实现有什么问题吗?
回答by Korem
As mentioned in the comments, dropand drop_duplicatescreates a new DataFrame, unless provided with an inplace argument. All these options would work:
如评论中所述,drop并drop_duplicates创建一个新的 DataFrame,除非提供了就地参数。所有这些选项都有效:
df = df.drop(dropRows)
df = df.drop_duplicates('b') #this doesnt work either
df.drop(dropRows, inplace = True)
df.drop_duplicates('b', inplace = True)
回答by johnecon
In my case the issue was that I was concatenating dfs with columns of different types:
就我而言,问题是我将 dfs 与不同类型的列连接起来:
import pandas as pd
s1 = pd.DataFrame([['a', 1]], columns=['letter', 'code'])
s2 = pd.DataFrame([['a', '1']], columns=['letter', 'code'])
df = pd.concat([s1, s2])
df = df.reset_index(drop=True)
df.drop_duplicates(inplace=True)
# 2 rows
print(df)
# int
print(type(df.at[0, 'code']))
# string
print(type(df.at[1, 'code']))
# Fix:
df['code'] = df['code'].astype(str)
df.drop_duplicates(inplace=True)
# 1 row
print(df)

