scala 快速获取数据框中的记录数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39357238/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:36:35  来源:igfitidea点击:

Getting the count of records in a data frame quickly

scalaapache-sparkhadoop-streaming

提问by thunderhemu

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.countis taking a very long time.

我有一个包含多达 1000 万条记录的数据框。我怎样才能快速得到计数?df.count需要很长时间。

回答by Ravi

It's going to take so much time anyway. At least the first time.

无论如何都要花很多时间。至少是第一次。

One way is to cache the dataframe, so you will be able to more with it, other than count.

一种方法是缓存数据帧,这样除了计数之外,您还可以使用它进行更多操作。

E.g

例如

df.cache()
df.count()

Subsequent operations don't take much time.

后续操作不会花费太多时间。

回答by Saad Ahmed

file.groupBy("<column-name>").count().show()