Python csv 文件上的 PySpark distinct().count()

Question

提问by dimzak

I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.

我是 spark 新手，我正在尝试根据 csv 文件的某些字段创建一个 distinct().count() 。

Csv structure(without header):

CSV 结构（无标题）：

id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2

to load .csv I typed:

加载 .csv 我输入：

lines = sc.textFile("test.txt")

then a distinct count on linesreturned 3 as expected:

然后lines按预期返回 3的不同计数：

lines.distinct().count()

But I have no idea how to make a distinct count based on lets say idand country.

但我不知道如何根据让说id和country.

Answer 1

In this case you would select the columns you want to consider, and then count:

在这种情况下，您将选择要考虑的列，然后计数：

sc.textFile("test.txt")\
  .map(lambda line: (line.split(',')[0], line.split(',')[1]))\
  .distinct()\
  .count()

This is for clarity, you can optimize the lambda to avoid calling line.splittwo times.

这是为了清楚起见，您可以优化 lambda 以避免调用line.split两次。

Answer 2

The split line can be optimized as follows:

分割线可以优化如下：

sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()