scala 替换火花数据帧中所有列名中的空格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36018072/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:04:48  来源:igfitidea点击:

Replacing whitespace in all column names in spark Dataframe

scalaapache-sparkapache-spark-sqlspark-dataframe

提问by vdep

I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.

我在某些列名中有空格的 spark 数据框,必须用下划线替换。

I know a single column can be renamed using withColumnRenamed()in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).

我知道可以withColumnRenamed()在 sparkSQL 中重命名单个列,但是要重命名“n”个列,此函数必须链接“n”次(据我所知)。

To automate this, i have tried:

为了自动化这个,我试过:

val old_names = df.columns()        // contains array of old column names

val new_names = old_names.map { x => 
   if(x.contains(" ") == true) 
      x.replaceAll("\s","_") 
   else x 
}                    // array of new column names with removed whitespace.

Now, how to replace df's header with new_names

现在,如何用 new_names

回答by Igor Berman

  var newDf = df
  for(col <- df.columns){
    newDf = newDf.withColumnRenamed(col,col.replaceAll("\s", "_"))
  }

You can encapsulate it in some method so it won't be too much pollution.

你可以用某种方法封装它,这样它就不会造成太多的污染。

回答by kanielc

As best practice, you should prefer expressions and immutability. You should use valand notvaras much as possible.

作为最佳实践,您应该更喜欢表达式和不变性。您应该使用valvar尽可能地。

Thus, it's preferable to use the foldLeftoperator, in this case :

因此,foldLeft在这种情况下,最好使用运算符:

val newDf = df.columns
              .foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\s", "_")))

回答by Hugo Reyes

In Python, this can be done by the following code:

在 Python 中,这可以通过以下代码完成:

# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col

# Building a simple dataframe:
schema = StructType([
             StructField("id name", StringType(), True),
             StructField("cities venezuela", StringType(), True)
         ])

column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']

# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()

回答by Victor Kironde

You can do the exact same thing in python:

你可以在 python 中做完全相同的事情:

raw_data1 = raw_data
for col in raw_data.columns:
  raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))

回答by Ajay Ahuja

In Scala, here is another way achieving same -

在 Scala 中,这是实现相同目标的另一种方法 -

    import org.apache.spark.sql.types._

    val df_with_newColumns = spark.createDataFrame(df.rdd, 
StructType(df.schema.map(s => StructField(s.name.replaceAll(" ", ""), 
s.dataType, s.nullable))))

Hope this helps !!

希望这可以帮助 !!