Java 中 List<String> 中的数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43633696/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Dataframe from List<String> in Java
提问by Devender
- Spark Version : 1.6.2
- Java Version: 7
- 火花版本:1.6.2
- Java 版本:7
I have a List<String>
data. Something like:
我有一个List<String>
数据。就像是:
[[dev, engg, 10000], [karthik, engg, 20000]..]
I know schema for this data.
我知道这些数据的架构。
name (String)
degree (String)
salary (Integer)
I tried:
我试过:
JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);
Output:
输出:
root
|-- _corrupt_record: string (nullable = true)
+-----------------------------+
|_corrupt_record |
+-----------------------------+
|[dev, engg, 10000] |
|[karthik, engg, 20000] |
+-----------------------------+
Because List<String>
is not a proper JSON.
因为List<String>
不是正确的 JSON。
Do I need to create a proper JSON or is there any other way to do this?
我需要创建一个合适的 JSON 还是有其他方法可以做到这一点?
回答by abaghel
You can create DataFrame from List<String>
and then use selectExpr
and split
to get desired DataFrame.
您可以从中创建 DataFrame List<String>
,然后使用 selectExpr
和split
来获取所需的 DataFrame。
public class SparkSample{
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqc = new SQLContext(jsc);
// sample data
List<String> data = new ArrayList<String>();
data.add("dev, engg, 10000");
data.add("karthik, engg, 20000");
// DataFrame
DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
df.printSchema();
df.show();
// Convert
DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
df1.printSchema();
df1.show();
}
}
You will get below output.
您将获得以下输出。
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+
root
|-- name: string (nullable = true)
|-- degree: string (nullable = true)
|-- salary: string (nullable = true)
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
The sample data you have provided has empty spaces. If you want to remove space and have the salary type as "integer" then you can use trim
and cast
function like below.
您提供的示例数据有空格。如果你想删除的空间,有工资类型为“整数”,那么你可以使用trim
和cast
功能如下图所示。
df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).??as("degree"),trim(co??l("salary")).cast("i??nteger").as("salary"??));
回答by Vikas Singh
DataFrame createNGramDataFrame(JavaRDD<String> lines) {
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
private static final long serialVersionUID = -4332903997027358601L;
@Override
public Row call(String line) throws Exception {
return RowFactory.create(line.split("\s+"));
}
});
StructType schema = new StructType(new StructField[] {
new StructField("words",
DataTypes.createArrayType(DataTypes.StringType), false,
Metadata.empty()) });
DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
// build a bigram language model
NGram transformer = new NGram().setInputCol("words")
.setOutputCol("ngrams").setN(2);
DataFrame ngramDF = transformer.transform(wordDF);
ngramDF.show(10, false);
return ngramDF;
}
回答by pasha701
Task can be completed without JSON, on Scala:
任务可以在没有 JSON 的情况下完成,在 Scala 上:
val data = List("dev, engg, 10000", "karthik, engg, 20000")
val intialRdd = sparkContext.parallelize(data)
val splittedRDD = intialRdd.map(current => {
val array = current.split(",")
(array(0), array(1), array(2))
})
import sqlContext.implicits._
val dataframe = splittedRDD.toDF("name", "degree", "salary")
dataframe.show()
Output is:
输出是:
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
Note: (array(0), array(1), array(2)) is a Scala Tuple
注意:(array(0), array(1), array(2)) 是一个 Scala 元组