scala 替换火花数据帧中所有列名中的空格

Question

提问by vdep

I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.

我在某些列名中有空格的 spark 数据框，必须用下划线替换。

I know a single column can be renamed using withColumnRenamed()in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).

我知道可以withColumnRenamed()在 sparkSQL 中重命名单个列，但是要重命名“n”个列，此函数必须链接“n”次（据我所知）。

To automate this, i have tried:

为了自动化这个，我试过：

val old_names = df.columns()        // contains array of old column names

val new_names = old_names.map { x => 
   if(x.contains(" ") == true) 
      x.replaceAll("\s","_") 
   else x 
}                    // array of new column names with removed whitespace.

Now, how to replace df's header with new_names

现在，如何用 new_names

Answer 1

回答by Igor Berman

  var newDf = df
  for(col <- df.columns){
    newDf = newDf.withColumnRenamed(col,col.replaceAll("\s", "_"))
  }

You can encapsulate it in some method so it won't be too much pollution.

你可以用某种方法封装它，这样它就不会造成太多的污染。

Answer 2

回答by kanielc

As best practice, you should prefer expressions and immutability. You should use valand notvaras much as possible.

作为最佳实践，您应该更喜欢表达式和不变性。您应该使用val和不var尽可能地。

Thus, it's preferable to use the foldLeftoperator, in this case :

因此，foldLeft在这种情况下，最好使用运算符：

val newDf = df.columns
              .foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\s", "_")))

Answer 3

回答by Hugo Reyes

In Python, this can be done by the following code:

在 Python 中，这可以通过以下代码完成：

# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col

# Building a simple dataframe:
schema = StructType([
             StructField("id name", StringType(), True),
             StructField("cities venezuela", StringType(), True)
         ])

column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']

# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()

Answer 4

回答by Victor Kironde

You can do the exact same thing in python:

你可以在 python 中做完全相同的事情：

raw_data1 = raw_data
for col in raw_data.columns:
  raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))

Answer 5

回答by Ajay Ahuja

In Scala, here is another way achieving same -

在 Scala 中，这是实现相同目标的另一种方法 -

    import org.apache.spark.sql.types._

    val df_with_newColumns = spark.createDataFrame(df.rdd, 
StructType(df.schema.map(s => StructField(s.name.replaceAll(" ", ""), 
s.dataType, s.nullable))))

Hope this helps !!

希望这可以帮助！！

scala 替换火花数据帧中所有列名中的空格

提问by vdep

回答by Igor Berman

回答by kanielc

回答by Hugo Reyes

回答by Victor Kironde

回答by Ajay Ahuja

相关推荐

最近更新

标签

scala 替换火花数据帧中所有列名中的空格

提问by vdep

回答by Igor Berman

回答by kanielc

回答by Hugo Reyes

回答by Victor Kironde

回答by Ajay Ahuja

相关推荐

scala 如何找到 spark RDD/Dataframe 大小？

scala SBT 使用的默认存储库是什么？

将 Java Future 转换为 Scala Future

scala 如何开始使用 Akka Streams？

相关推荐

最近更新

标签