Python csv 文件上的 PySpark distinct().count()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27987298/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PySpark distinct().count() on a csv file
提问by dimzak
I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.
我是 spark 新手,我正在尝试根据 csv 文件的某些字段创建一个 distinct().count() 。
Csv structure(without header):
CSV 结构(无标题):
id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2
to load .csv I typed:
加载 .csv 我输入:
lines = sc.textFile("test.txt")
then a distinct count on linesreturned 3 as expected:
然后lines按预期返回 3的不同计数:
lines.distinct().count()
But I have no idea how to make a distinct count based on lets say idand country.
但我不知道如何根据让说id和country.
采纳答案by elyase
In this case you would select the columns you want to consider, and then count:
在这种情况下,您将选择要考虑的列,然后计数:
sc.textFile("test.txt")\
.map(lambda line: (line.split(',')[0], line.split(',')[1]))\
.distinct()\
.count()
This is for clarity, you can optimize the lambda to avoid calling line.splittwo times.
这是为了清楚起见,您可以优化 lambda 以避免调用line.split两次。
回答by rami
The split line can be optimized as follows:
分割线可以优化如下:
sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()

