Java 计算RDD中的行数

Question

提问by Amine CHERIFI

I'm using spark with java, and i hava an RDD of 5 millions rows. Is there a sollution that allows me to calculate the number of rows of my RDD. I've tried RDD.count()but it takes a lot of time. I've seen that i can use the function fold. But i didn't found a java documentation of this function. Could you please show me how to use it or show me another solution to get the number of rows of my RDD.

我正在将 spark 与 java 一起使用，并且我拥有 5 百万行的 RDD。是否有一种解决方案可以让我计算我的 RDD 的行数。我试过了，RDD.count()但需要很多时间。我已经看到我可以使用该功能fold。但是我没有找到这个函数的 java 文档。您能否告诉我如何使用它或向我展示另一个解决方案来获取我的 RDD 的行数。

Here is my code :

这是我的代码：

JavaPairRDD<String, String> lines = getAllCustomers(sc).cache();
JavaPairRDD<String,String> CFIDNotNull = lines.filter(notNull()).cache();
JavaPairRDD<String, Tuple2<String, String>> join =lines.join(CFIDNotNull).cache();

double count_ctid = (double)join.count(); // i want to get the count of these three RDD
double all = (double)lines.count();
double count_cfid = all - CFIDNotNull.count();
System.out.println("********** :"+count_cfid*100/all +"% and now : "+ count_ctid*100/all+"%");

Thank you.

谢谢你。

Answer 1

采纳答案by Daniel Darabos

You had the right idea: use rdd.count()to count the number of rows. There is no faster way.

您的想法是正确的：用于rdd.count()计算行数。没有更快的方法。

I think the question you should have asked is why is rdd.count()so slow?

我想你应该问的问题是为什么rdd.count()这么慢？

The answer is that rdd.count()is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count()were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up. When you call count(), you force all the previous lazy operations to be performed. The input files need to be loaded now, map()s and filter()s executed, shuffles performed, etc, until finally we have the data and can say how many rows it has.

答案是这rdd.count()是一个“动作”——它是一个急切的操作，因为它必须返回一个实际的数字。您之前执行的 RDD 操作count()是“转换”——它们懒惰地将一个 RDD 转换为另一个 RDD。实际上，转换并未实际执行，只是排队。当您调用时count()，您将强制执行所有先前的惰性操作。现在需要加载输入文件，执行map()s 和filter()s，执行 shuffle 等，直到最终我们有了数据并且可以说出它有多少行。

Note that if you call count()twice, all this will happen twice. After the count is returned, all the data is discarded! If you want to avoid this, call cache()on the RDD. Then the second call to count()will be fast and also derived RDDs will be faster to calculate. However, in this case the RDD will have to be stored in memory (or disk).

请注意，如果您调用count()两次，所有这些都会发生两次。计数返回后，数据全部丢弃！如果您想避免这种情况，请调用cache()RDD。然后第二次调用count()将很快，并且派生的 RDD 计算也会更快。但是，在这种情况下，RDD 必须存储在内存（或磁盘）中。

Answer 2

回答by Timothy Perrigo

Daniel's explanation of countis right on the money. If you are willing to accept an approximation, though, you could try the countApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]RDD method. (Note, though, that this is tagged as "Experimental").

丹尼尔count对金钱的解释是正确的。但是，如果您愿意接受近似值，则可以尝试countApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]RDD 方法。（但请注意，这被标记为“实验性”）。

Java 计算RDD中的行数

提问by Amine CHERIFI

采纳答案by Daniel Darabos

回答by Timothy Perrigo

相关推荐

最近更新

标签

Java 计算RDD中的行数

提问by Amine CHERIFI

采纳答案by Daniel Darabos

回答by Timothy Perrigo

相关推荐

Java 将本地依赖项添加到 Maven 项目的最佳方法

Java 减去本地时间

Java 在 Android 中正确实现 PagerAdapter

Java Http 状态 401 此请求需要 HTTP 身份验证 ()。在 tomcat 6 中

相关推荐

最近更新

标签