Java 如何将 Spark Row 的数据集转换为字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42389203/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 00:06:38  来源:igfitidea点击:

How to convert the datasets of Spark Row into string?

javastringapache-sparkapache-spark-sqlapache-spark-dataset

提问by Jaffer Wilson

I have written the code to access the Hive table using SparkSQL. Here is the code:

我已经编写了使用 SparkSQL 访问 Hive 表的代码。这是代码:

SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark Hive Example")
        .master("local[*]")
        .config("hive.metastore.uris", "thrift://localhost:9083")
        .enableHiveSupport()
        .getOrCreate();
Dataset<Row> df =  spark.sql("select survey_response_value from health").toDF();
df.show();

I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toStringor typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?

我想知道如何将完整的输出转换为字符串或字符串数​​组?当我尝试使用另一个模块时,只有我可以传递字符串或字符串类型的数组值。
我尝试过其他方法,例如.toString或类型转换为字符串值。但没有为我工作。
请让我知道如何将数据集值转换为字符串?

采纳答案by abaghel

Here is the sample code in Java.

这是 Java 中的示例代码。

public class SparkSample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
            .builder()
            .appName("SparkSample")
            .master("local[*]")
            .getOrCreate();
    //create df
    List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
    Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
    df.show();
    //using df.as
    List<String> listOne = df.as(Encoders.STRING()).collectAsList();
    System.out.println(listOne);
    //using df.map
    List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
    System.out.println(listTwo);
  }
}

"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html

“行”是 java 8 lambda 参数。请查看developer.com/java/start-using-java-lambda-expressions.html

回答by hage

You can use the U)(implicitevidence$6:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]" rel="noreferrer">mapfunction to convert every row into a string, e.g.:

您可以使用该U)(implicitevidence$6:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]" rel="noreferrer">map函数将每一行转换为字符串,例如:

df.map(row => row.mkString())

Instead of just mkStringyou can of course do more sophisticated work

mkString当然,您不仅可以做更复杂的工作

The collectmethod then can retreive the whole thing into an array

collect然后该方法可以将整个事物检索到一个数组中

val strings = df.map(row => row.mkString()).collect

(This is the Scala syntax, I think in Java it's quite similar)

(这是 Scala 语法,我认为在 Java 中它非常相似)

回答by Areeha

If you are planning to read the dataset line by line, then you can use the iterator over the dataset:

如果您打算逐行读取数据集,则可以在数据集上使用迭代器:

 Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);

for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
    String item = (iter.next()).toString();
    System.out.println(item.toString());    
}