scala 使用Scala中的列和索引将数组转换为数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51036010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert Array into dataframe with columns and index in Scala
提问by PRIYA M
Initially I have a matrix
最初我有一个矩阵
0.0 0.4 0.4 0.0
0.1 0.0 0.0 0.7
0.0 0.2 0.0 0.3
0.3 0.0 0.0 0.0
The matrix matrixis converted into a normal_arrayby
该矩阵matrix被转换成normal_array由
`val normal_array = matrix.toArray`
and I have an array of string
我有一个字符串数组
inputCols : Array[String] = Array(p1, p2, p3, p4)
I need to convert this matrix into a following data frame. (Note: The number of rows and columns in the matrix will be the same as the length of the inputCols)
我需要将此矩阵转换为以下数据框。(注意:矩阵中的行数和列数将与 的长度相同inputCols)
index p1 p2 p3 p4
p1 0.0 0.4 0.4 0.0
p2 0.1 0.0 0.0 0.7
p3 0.0 0.2 0.0 0.3
p4 0.3 0.0 0.0 0.0
In python, this can be easily achieved by pandaslibrary.
在python中,这可以通过pandas库轻松实现。
arrayToDataframe = pandas.DataFrame(normal_array,columns = inputCols, index = inputCols)
But how can I do this in Scala?
但是我怎么能做到这一点Scala呢?
采纳答案by Manoj Kumar Dhakad
You can do something like below
您可以执行以下操作
//convert your data to Scala Seq/List/Array
val list = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
//Define your Array of desired columns
val inputCols : Array[String] = Array("p1", "p2", "p3", "p4")
//Create DataFrame from given data, It will create dataframe with its own column names like _c1,_c2 etc
val df = sparkSession.createDataFrame(list)
//Getting the list of column names from dataframe
val dfColumns=df.columns
//Creating query to rename columns
val query=inputCols.zipWithIndex.map(index=>dfColumns(index._2)+" as "+inputCols(index._2))
//Firing above query
val newDf=df.selectExpr(query:_*)
//Creating udf which get index(0,1,2,3) as input and returns corresponding column name from your given array of columns
val getIndexUDF=udf((row_no:Int)=>inputCols(row_no))
//Adding temporary column row_no which contains index of row and removing after adding index column
val dfWithRow=newDf.withColumn("row_no",monotonicallyIncreasingId).withColumn("index",getIndexUDF(col("row_no"))).drop("row_no")
dfWithRow.show
Sample Output:
示例输出:
+---+---+---+---+-----+
| p1| p2| p3| p4|index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0| p1|
|0.1|0.0|0.0|0.7| p2|
|0.0|0.2|0.0|0.3| p3|
|0.3|0.0|0.0|0.0| p4|
+---+---+---+---+-----+
回答by 1pluszara
Here is another way:
这是另一种方式:
val data = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
val cols = Array("p1", "p2", "p3", "p4","index")
Zip the collection and convert it into DataFrame.
压缩集合并将其转换为 DataFrame。
data.zip(cols).map {
case (col,index) => (col._1,col._2,col._3,col._4,index)
}.toDF(cols: _*)
Output:
输出:
+---+---+---+---+-----+
|p1 |p2 |p3 |p4 |index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|p1 |
|0.1|0.0|0.0|0.7|p2 |
|0.0|0.2|0.0|0.3|p3 |
|0.3|0.0|0.0|0.0|p4 |
+---+---+---+---+-----+

