在 Python Spark 中查看 RDD 内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25295277/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
View RDD contents in Python Spark?
提问by lmart999
Running a simple app in pyspark.
在 pyspark 中运行一个简单的应用程序。
f = sc.textFile("README.md")
wc = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
I want to view RDD contents using foreach action:
我想使用 foreach 操作查看 RDD 内容:
wc.foreach(print)
This throws a syntax error:
这会引发语法错误:
SyntaxError: invalid syntax
What am I missing?
我错过了什么?
采纳答案by Josh Rosen
This error is because printisn't a function in Python 2.6.
这个错误是因为print它不是 Python 2.6 中的函数。
You can either define a helper UDF that performs the print, or use the __future__library to treat printas a function:
您可以定义执行打印的辅助 UDF,也可以使用__future__库将其print视为函数:
>>> from operator import add
>>> f = sc.textFile("README.md")
>>> def g(x):
... print x
...
>>> wc.foreach(g)
or
或者
>>> from __future__ import print_function
>>> wc.foreach(print)
However, I think it would be better to use collect()to bring the RDD contents back to the driver, because foreachexecutes on the worker nodes and the outputs may not necessarily appear in your driver / shell (it probably will in localmode, but not when running on a cluster).
但是,我认为最好使用collect()将 RDD 内容带回驱动程序,因为foreach在工作节点上执行并且输出可能不一定出现在您的驱动程序/外壳中(它可能会在local模式下,但在运行时不会出现)一个集群)。
>>> for x in wc.collect():
... print x
回答by Jeevs
Try this:
尝试这个:
data = f.flatMap(lambda x: x.split(' '))
map = data.map(lambda x: (x, 1))
mapreduce = map.reduceByKey(lambda x,y: x+y)
result = mapreduce.collect()
Please note that when you run collect(), the RDD - which is a distributed data set is aggregated at the driver node and is essentially converted to a list. So obviously, it won't be a good idea to collect() a 2T data set. If all you need is a couple of samples from your RDD, use take(10).
请注意,当您运行 collect() 时,RDD(分布式数据集)会在驱动程序节点上聚合,并且本质上会转换为列表。所以很明显,collect() 2T 数据集不是一个好主意。如果您只需要来自 RDD 的几个样本,请使用 take(10)。
回答by iec2011007
If you want to see the contents of RDD then yes collect is one option, but it fetches all the data to driver so there can be a problem
如果您想查看 RDD 的内容,则 yes collect 是一种选择,但它会将所有数据提取到驱动程序,因此可能会出现问题
<rdd.name>.take(<num of elements you want to fetch>)
Better if you want to see just a sample
如果您只想查看示例,则更好
Running foreachand trying to print, I dont recommend this because if you are running this on cluster then the print logs would be local to the executor and it would print for the data accessible to that executor. printstatement is not changing the state hence it is not logically wrong. To get all the logs you will have to do something like
运行foreach并尝试打印,我不推荐这样做,因为如果您在集群上运行它,那么打印日志将是执行程序本地的,并且它将打印该执行程序可访问的数据。print语句不会改变状态,因此它在逻辑上没有错误。要获取所有日志,您必须执行以下操作
**Pseudocode**
collect
foreach print
But this may result in job failure as collecting all the data on driver may crash it. I would suggest using takecommand or if u want to analyze it then use samplecollect on driver or write to file and then analyze it.
但这可能会导致作业失败,因为收集驱动程序上的所有数据可能会使它崩溃。我建议使用take命令,或者如果你想分析它然后在驱动程序上使用样本收集或写入文件然后分析它。
回答by Frederico Oliveira
In Spark 2.0 (I didn't tested with earlier versions). Simply:
在 Spark 2.0 中(我没有用早期版本测试过)。简单地:
print myRDD.take(n)
Where nis the number of lines and myRDDis wcin your case.
哪里ñ是行数和myRDD是WC你的情况。
回答by YDD9
By latest document, you can use rdd.collect().foreach(println) on the driver to display all, but it may cause memory issues on the driver, best is to use rdd.take(desired_number)
根据最新文档,可以在驱动上使用 rdd.collect().foreach(println) 来显示全部,但可能会导致驱动内存问题,最好使用 rdd.take(desired_number)
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
要打印驱动程序上的所有元素,可以使用 collect() 方法首先将 RDD 带到驱动程序节点,例如:rdd.collect().foreach(println)。但是,这可能会导致驱动程序内存不足,因为 collect() 将整个 RDD 提取到一台机器上;如果你只需要打印 RDD 的几个元素,更安全的方法是使用 take():rdd.take(100).foreach(println)。
回答by alehresmann
You can simply collect the entire RDD (which will return a list of rows) and print said list:
您可以简单地收集整个 RDD(它将返回一个行列表)并打印所述列表:
print(wc.collect())

