如何在 python 中在 spark 中打印 rdd

Question

提问by yguw

I have two files on HDFS and I just want to join these two files on a column say employee id.

我在 HDFS 上有两个文件，我只想将这两个文件加入一个列中，比如员工 ID。

I am trying to simply print the files to make sure we are reading that correctly from HDFS.

我试图简单地打印文件以确保我们从 HDFS 正确读取文件。

lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()

I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

我也尝试过 foreach 和 println 函数，但无法显示文件数据。我正在使用 python 并且对 python 和 spark 也是全新的。

Answer 1

回答by Alberto Bonsanto

This is really easy just do a collectYou must be sure that all the data fits the memory on your master

这真的很容易只做一个收集你必须确保所有的数据都适合你的主人的记忆

my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()

If that is not the case You must just take a sample by using takemethod.

如果不是这种情况，您必须使用take方法取样。

# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)

Another example using .ipynb:

另一个使用 .ipynb 的例子：

如何在 python 中在 spark 中打印 rdd

提问by yguw

回答by Alberto Bonsanto

相关推荐

最近更新

标签

如何在 python 中在 spark 中打印 rdd

提问by yguw

回答by Alberto Bonsanto

相关推荐

Python 如何在matplotlib中在x轴上显示日期和时间

Python scikit-learn 中 LogisticRegression 上的 GridSearchCV

Python 如何修复 IndexError：标量变量的无效索引

Python 创建一个包含 100 个整数的列表，其值等于它们的索引

相关推荐

最近更新

标签