scala 如何在spark的数据框中“否定选择”列

Question

提问by Blaubaer

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:

我无法弄清楚，但我想这很简单。我有一个火花数据框 df。这个 df 有列“A”、“B”和“C”。现在假设我有一个包含此 df 列名称的数组：

column_names = Array("A","B","C")

I'd like to do a df.select()in such a way, that I can specify which columns notto select. Example: let's say I do not want to select columns "B". I tried

我想这样做df.select()，我可以指定不选择哪些列。示例：假设我不想选择列“B”。我试过

df.select(column_names.filter(_!="B"))

but this does not work, as

但这不起作用，因为

org.apache.spark.sql.DataFrame cannot be applied to (Array[String])

org.apache.spark.sql.DataFrame 不能应用于 (Array[String])

So, hereit says it should work with a Seq instead. However, trying

所以，这里说它应该与 Seq 一起使用。然而，试图

df.select(column_names.filter(_!="B").toSeq)

results in

结果是

org.apache.spark.sql.DataFrame cannot be applied to (Seq[String]).

org.apache.spark.sql.DataFrame 不能应用于 (Seq[String])。

What am I doing wrong?

我究竟做错了什么？

Answer 1

回答by zero323

Since Spark 1.4you can use dropmethod:

从 Spark 1.4 开始，您可以使用drop方法：

Scala:

斯卡拉：

case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")

Python:

蟒蛇：

df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]

Answer 2

回答by Edi Bice

I had the same problem and solved it this way (oaffdf is a dataframe):

我遇到了同样的问题并以这种方式解决了它（oaffdf 是一个数据框）：

val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)

https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html

Answer 3

回答by Francois G

OK, it's ugly, but this quick spark shell session shows something that works:

好吧，这很丑陋，但是这个快速的 spark shell 会话显示了一些有效的东西：

scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21

scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]

scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21

scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]

scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]

scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)

scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)

scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]

Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargsa bit more complex.

将可变参数提供给需要在其他 SO 问题中处理的函数。select 的签名是为了确保您的选定列列表不为空——这使得从选定列列表到可变参数的转换变得更加复杂。

Answer 4

回答by asmaier

In pyspark you can do

在 pyspark 你可以做

df.select(list(set(df.columns) - set(["B"])))

Using more than one line you can also do

使用多条线，你也可以做

cols = df.columns
cols.remove("B")
df.select(cols)

Answer 5

回答by oluies

val columns = Seq("A","B","C")

df.select(columns.diff(Seq("B")))

Answer 6

回答by Tagar

Will be possible to do through [SPARK-12139] REGEX Column Specification for Hive Queries

可以通过 [SPARK-12139] Hive 查询的 REGEX 列规范来完成

https://issues.apache.org/jira/browse/SPARK-12139

Answer 7

回答by Ani Menon

For Spark v1.4and higher, using drop(*cols)-

对于Spark v1.4及更高版本，使用drop(*cols)-

Returns a new DataFrame without the specified column(s).

返回一个没有指定列的新 DataFrame。

Example -

例子 -

df.drop('age').collect()

For Spark v2.3and higher you could also do it using colRegex(colName)-

对于Spark v2.3及更高版本，您也可以使用colRegex(colName)-

Selects column based on the column name specified as a regex and returns it as Column.

根据指定为正则表达式的列名选择列并将其作为Column返回。

Example-

例子-

df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()

Reference - colRegex, drop

参考 - colRegex, drop

For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use selectto pick the resultant list.

对于旧版本的 Spark，获取数据框中的列列表，然后删除要从中删除的列（可能使用 set 操作），然后用于select选择结果列表。

Answer 8

回答by danny

//selectWithout allows you to specify which columns to omit:

//selectWithout 允许您指定要省略哪些列：

df.selectWithout("B")

scala 如何在spark的数据框中“否定选择”列

提问by Blaubaer

回答by zero323

回答by Edi Bice

回答by Francois G

回答by asmaier

回答by oluies

回答by Tagar

回答by Ani Menon

回答by danny

相关推荐

最近更新

标签

scala 如何在spark的数据框中“否定选择”列

提问by Blaubaer

回答by zero323

回答by Edi Bice

回答by Francois G

回答by asmaier

回答by oluies

回答by Tagar

回答by Ani Menon

回答by danny

相关推荐

scala 直接从 Spark shell 读取 ORC 文件

scala 播放 2.4：表单：找不到参数消息的隐式值：play.api.i18n.Messages

scala 如何将 akka http 请求实体解组为字符串？

scala 我在哪里可以在 Intellij IDEA 中为 SBT 设置代理？

相关推荐

最近更新

标签