scala 如何将 RDD 保存到 HDFS 中并稍后读回?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40069264/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:44:44  来源:igfitidea点击:

How can I save an RDD into HDFS and later read it back?

scalaapache-sparkhdfsrddbigdata

提问by pythonic

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?

我有一个 RDD,其元素类型为 (Long, String)。出于某种原因,我想将整个 RDD 保存到 HDFS 中,然后再在 Spark 程序中读取该 RDD。有可能这样做吗?如果是这样,如何?

采纳答案by T. Gaw?da

It is possible.

有可能的。

In RDD you have saveAsObjectFileand saveAsTextFilefunctions. Tuples are stored as (value1, value2), so you can later parse it.

在 RDD 中,你有saveAsObjectFilesaveAsTextFile函数。元组存储为(value1, value2),因此您可以稍后解析它。

Reading can be done with textFilefunction from SparkContext and then .mapto eliminate ()

可以使用textFileSparkContext 中的函数进行读取,然后.map消除()

So: Version 1:

所以:版本 1:

rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
    // here remove () and parse long / strings
})

Version 2:

版本 2:

rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])

回答by Kris

I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized. However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.

如果您的 RDD 是表格格式,我会建议使用 DataFrame。数据框是一个表格或二维数组结构,其中每一列包含一个变量的测量值,每一行包含一个案例。由于其表格格式,DataFrame 具有额外的元数据,这允许 Spark 对最终查询运行某些优化。其中 RDD 是一个弹性分布式数据集,它更像是一个无法优化的黑盒或数据的核心抽象。但是,您可以从 DataFrame 转到 RDD,反之亦然,并且可以通过 toDF 方法从 RDD 转到 DataFrame(如果 RDD 为表格格式)。

The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,

以下是在 HDFS 中创建/存储 CSV 和 Parquet 格式的 DataFrame 的示例,

val conf = {
   new SparkConf()
     .setAppName("Spark-HDFS-Read-Write")
 }

 val sqlContext = new SQLContext(sc)

 val sc = new SparkContext(conf)

 val hdfs = "hdfs:///"
 val df = Seq((1, "Name1")).toDF("id", "name")

 //  Writing file in CSV format
 df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")

 // Writing file in PARQUET format
 df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")

 //  Reading CSV files from HDFS
 val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")

 // Reading PQRQUET files from HDFS
 val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")