scala 如何在spark的数据框中“否定选择”列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31434886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to "negative select" columns in spark's dataframe
提问by Blaubaer
I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
我无法弄清楚,但我想这很简单。我有一个火花数据框 df。这个 df 有列“A”、“B”和“C”。现在假设我有一个包含此 df 列名称的数组:
column_names = Array("A","B","C")
I'd like to do a df.select()in such a way, that I can specify which columns notto select.
Example: let's say I do not want to select columns "B". I tried
我想这样做df.select(),我可以指定不选择哪些列。示例:假设我不想选择列“B”。我试过
df.select(column_names.filter(_!="B"))
but this does not work, as
但这不起作用,因为
org.apache.spark.sql.DataFrame cannot be applied to (Array[String])
org.apache.spark.sql.DataFrame 不能应用于 (Array[String])
So, hereit says it should work with a Seq instead. However, trying
所以,这里说它应该与 Seq 一起使用。然而,试图
df.select(column_names.filter(_!="B").toSeq)
results in
结果是
org.apache.spark.sql.DataFrame cannot be applied to (Seq[String]).
org.apache.spark.sql.DataFrame 不能应用于 (Seq[String])。
What am I doing wrong?
我究竟做错了什么?
回答by zero323
Since Spark 1.4you can use dropmethod:
从 Spark 1.4 开始,您可以使用drop方法:
Scala:
斯卡拉:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
蟒蛇:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
回答by Edi Bice
I had the same problem and solved it this way (oaffdf is a dataframe):
我遇到了同样的问题并以这种方式解决了它(oaffdf 是一个数据框):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
回答by Francois G
OK, it's ugly, but this quick spark shell session shows something that works:
好吧,这很丑陋,但是这个快速的 spark shell 会话显示了一些有效的东西:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in otherSO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargsa bit more complex.
将可变参数提供给需要在其他SO 问题中处理的函数。select 的签名是为了确保您的选定列列表不为空——这使得从选定列列表到可变参数的转换变得更加复杂。
回答by asmaier
In pyspark you can do
在 pyspark 你可以做
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
使用多条线,你也可以做
cols = df.columns
cols.remove("B")
df.select(cols)
回答by oluies
val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))
回答by Tagar
Will be possible to do through [SPARK-12139] REGEX Column Specification for Hive Queries
可以通过 [SPARK-12139] Hive 查询的 REGEX 列规范来完成
回答by Ani Menon
For Spark v1.4and higher, using drop(*cols)-
对于Spark v1.4及更高版本,使用drop(*cols)-
Returns a new DataFrame without the specified column(s).
返回一个没有指定列的新 DataFrame。
Example -
例子 -
df.drop('age').collect()
For Spark v2.3and higher you could also do it using colRegex(colName)-
对于Spark v2.3及更高版本,您也可以使用colRegex(colName)-
Selects column based on the column name specified as a regex and returns it as Column.
根据指定为正则表达式的列名选择列并将其作为Column返回。
Example-
例子-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use selectto pick the resultant list.
对于旧版本的 Spark,获取数据框中的列列表,然后删除要从中删除的列(可能使用 set 操作),然后用于select选择结果列表。
回答by danny
//selectWithout allows you to specify which columns to omit:
//selectWithout 允许您指定要省略哪些列:
df.selectWithout("B")

