Python 从pyspark中的数据框中删除重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31064243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:25:31  来源:igfitidea点击:

remove duplicates from a dataframe in pyspark

pythonapache-sparkpyspark

提问by Jared

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the drop duplicates method to work. Keeps returning the error "AttributeError: 'list' object has no attribute 'dropDuplicates'". Not quite sure why as I seem to be following the syntax in the latest documentation. Seems like I am missing an import for that functionality or something.

我在本地处理 pyspark 1.4 中的数据帧,并且在使 drop duplicates 方法起作用时遇到问题。不断返回错误“AttributeError: 'list' object has no attribute 'dropDuplicates'”。不太确定为什么,因为我似乎在遵循最新文档中的语法。似乎我缺少该功能的导入或其他东西。

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()

采纳答案by zero323

It is not an import problem. You simply call .dropDuplicates()on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...)is pyspark.sql.dataframe.DataFrame, after you apply .collect()it is a plain Python list, and lists don't provide dropDuplicatesmethod. What you want is something like this:

这不是进口问题。您只需调用.dropDuplicates()错误的对象即可。虽然 class of sqlContext.createDataFrame(rdd1, ...)is pyspark.sql.dataframe.DataFrame,应用后.collect()它是一个普通的 Python list,并且列表不提供dropDuplicates方法。你想要的是这样的:

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

回答by Grant Shannon

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

如果您有一个数据框并想要删除所有重复项 - 参考特定列(称为“colName”)中的重复项:

count before dedupe:

重复数据删除前计数:

df.count()

do the de-dupe (convert the column you are de-duping to string type):

执行重复数据删除(将要重复数据删除的列转换为字符串类型):

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

can use a sorted groupby to check to see that duplicates have been removed:

可以使用排序的 groupby 来检查是否删除了重复项:

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)