Spark 使用 Python：将 RDD 输出保存到文本文件中

Question

提问by RACHITA PATRO

I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.

我正在尝试使用 python 在 spark 中解决字数问题。但是当我尝试使用 .saveAsTextFile 命令将输出 RDD 保存在文本文件中时，我遇到了这个问题。这是我的代码。请帮我。我被困住了。感谢您的时间。

import re

from pyspark import SparkConf , SparkContext

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)

input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")

words=input.flatMap(normalizewords)

wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()

results=sortedwordsCount.collect()

for result in results:
    count=str(result[0])
    word=result[1].encode('ascii','ignore')

    if(word):
        print word +"\t\t"+ count

results.saveAsTextFile("/var/www/myoutput")

Answer 1

采纳答案by WoodChopper

since you collected results=sortedwordsCount.collect()so, its not RDD. It will be normal python list or tuple.

既然你收集了results=sortedwordsCount.collect()，它不是RDD。它将是普通的 python 列表或元组。

As you know listis python object/data structure and appendis method to add element.

如您所知list，python 对象/数据结构append是添加元素的方法。

>>> x = []
>>> x.append(5)
>>> x
[5]

Similarly RDDis sparks object/data structure and saveAsTextFileis method to write the file. Important thing is its distributed data structure.

同样 RDD是 sparks 对象/数据结构，saveAsTextFile也是写入文件的方法。重要的是它的分布式数据结构。

So, we cannot use appendon RDD or saveAsTextFileon list. collectis method on RDD to get to RDD to driver memory.

因此，我们不能append在 RDD 或saveAsTextFile列表上使用。collect是 RDD 上的方法，用于将 RDD 连接到驱动程序内存。

As mentioned in comments, save sortedwordsCountwith saveAsTextFile or open file in python and use resultsto write in a file

如评论中所述，sortedwordsCount使用 saveAsTextFile保存或在 python 中打开文件并用于results写入文件

Answer 2

回答by Derrick wang

Change results=sortedwordsCount.collect()to results=sortedwordsCount, because using .collect()results will be a list.

更改results=sortedwordsCount.collect()为results=sortedwordsCount，因为使用.collect()结果将是一个列表。

Spark 使用 Python：将 RDD 输出保存到文本文件中

提问by RACHITA PATRO

采纳答案by WoodChopper

回答by Derrick wang

相关推荐

最近更新

标签

Spark 使用 Python：将 RDD 输出保存到文本文件中

提问by RACHITA PATRO

采纳答案by WoodChopper

回答by Derrick wang

相关推荐

Python 插入mysql数据库时间戳

Python 回调函数如何在多处理 map_async 中工作？

Python 如何为每个列名添加后缀（或前缀）？

Python “for line in...”导致 UnicodeDecodeError: 'utf-8' codec can't decode byte

相关推荐

最近更新

标签