scala 快速获取数据框中的记录数

Question

提问by thunderhemu

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.countis taking a very long time.

我有一个包含多达 1000 万条记录的数据框。我怎样才能快速得到计数？df.count需要很长时间。

Answer 1

It's going to take so much time anyway. At least the first time.

无论如何都要花很多时间。至少是第一次。

One way is to cache the dataframe, so you will be able to more with it, other than count.

一种方法是缓存数据帧，这样除了计数之外，您还可以使用它进行更多操作。

E.g

例如

df.cache()
df.count()

Subsequent operations don't take much time.

后续操作不会花费太多时间。

Answer 2

file.groupBy("<column-name>").count().show()