SQL 使用 Scala 中的数据帧在 Spark 1.30 中保存为文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29302901/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 03:28:14  来源:igfitidea点击:

Saving as Text in Spark 1.30 using Dataframes in Scala

sqlscalaapache-spark

提问by jeffrey podolsky

I am using Spark version 1.3.0 and using dataframes with SparkSQL in Scala. In version 1.2.0 there was a method called "saveAsText". In version 1.3.0 using dataframes there is only a "save" method. The default output is parquet.
How can I specify the output should be TEXT using the save method ?

我使用的是 Spark 1.3.0 版,并在 Scala 中使用带有 SparkSQL 的数据帧。在 1.2.0 版本中,有一个名为“saveAsText”的方法。在使用数据帧的 1.3.0 版中,只有“保存”方法。默认输出是镶木地板。
如何使用 save 方法指定输出应为 TEXT ?

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

teenagers.save("/user/me/out")

回答by ngtrkhoa

You can use this:

你可以使用这个:

teenagers.rdd.saveAsTextFile("/user/me/out")

回答by lev

First off, you should consider if you really need to save the data frame as text. Because DataFrameholds the data by columns (and not by rows as rdd), .rddoperation is costly, because the data need to be reprocessed for that. parquet is a columnar format and is much more efficient to be used.

首先,您应该考虑是否真的需要将数据框保存为文本。因为DataFrame按列保存数据(而不是按行作为 rdd),.rdd操作成本高昂,因为需要为此重新处理数据。镶木地板是一种柱状格式,使用起来效率更高。

That being said, sometimes you really do need to save as a text file.

话虽如此,有时您确实需要另存为文本文件。

As far as I know DataFrameout of the box won't let you save as text file. If you look at the source code, you'll see that 4 formats are supported:

据我所知,DataFrame开箱即用不会让您保存为文本文件。如果查看源代码,您会看到支持 4 种格式:

jdbc
json
parquet
orc

so your options are either using df.rdd.saveAsTextFileas suggested before, or to use spark-csv, that will allow you to do something like:

所以您的选择要么df.rdd.saveAsTextFile按照之前的建议使用,要么使用spark-csv,这将允许您执行以下操作:

Spark 1.4+:

火花 1.4+:

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars.csv")

Spark 1.3:

火花1.3:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv")

with the added value of handling the annoying parts of quoting and escaping of the strings

处理令人讨厌的字符串引用和转义部分的附加价值

回答by Sietse

If you look at the migration guide https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13, you can see that

如果您查看迁移指南https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13,您可以看到

[...] DataFrames no longer inherit from RDD directly [...]

[...] DataFrames 不再直接从 RDD 继承 [...]

You can still use saveAsTextFile if you use ".rdd" method to get a RDD[Row].

如果您使用“.rdd”方法来获取 RDD[Row],您仍然可以使用 saveAsTextFile。

回答by Robert Chevallier

In python: to get a CSV (no header) for dataframe df

在python中:获取数据帧df的CSV(无标题)

df.rdd.map(lambda r: ";".join([str(c) for c in r])).saveAsTextFile(outfilepath)

There is also an extension developped by Databricks: spark-csv

还有一个由 Databricks 开发的扩展:spark-csv

Cf https://github.com/databricks/spark-csv

参见https://github.com/databricks/spark-csv