如何在 python 中在 spark 中打印 rdd

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33027949/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:37:56  来源:igfitidea点击:

How to print rdd in python in spark

pythonapache-sparkpysparkapache-spark-sql

提问by yguw

I have two files on HDFS and I just want to join these two files on a column say employee id.

我在 HDFS 上有两个文件,我只想将这两个文件加入一个列中,比如员工 ID。

I am trying to simply print the files to make sure we are reading that correctly from HDFS.

我试图简单地打印文件以确保我们从 HDFS 正确读取文件。

lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()

I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

我也尝试过 foreach 和 println 函数,但无法显示文件数据。我正在使用 python 并且对 python 和 spark 也是全新的。

回答by Alberto Bonsanto

This is really easy just do a collectYou must be sure that all the data fits the memory on your master

这真的很容易只做一个收集你必须确保所有的数据都适合你的主人的记忆

my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()

If that is not the case You must just take a sample by using takemethod.

如果不是这种情况,您必须使用take方法取样。

# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)

Another example using .ipynb:

另一个使用 .ipynb 的例子: