如何使用Java中的spark用Dataframe中的特定值替换空值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44671597/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to replace null values with a specific value in Dataframe using spark in Java?
提问by PirateHyman
I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-
我正在尝试使用 Java 提高在 Spark 中实现的逻辑回归算法的准确性。为此,我试图用该列的最频繁值替换该列中存在的 Null 或无效值。例如:-
Name|Place
a |a1
a |a2
a |a2
|d1
b |a2
c |a2
c |
|
d |c1
In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.
在这种情况下,我会将“Name”列中的所有 NULL 值替换为“a”,将“Place”列中的所有 NULL 值替换为“a2”。到目前为止,我只能提取特定列中最频繁的列。关于如何用该列的最常见值替换空值或无效值的第二步,您能帮我吗?
采纳答案by Rami
You can use .na.fill
function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).
您可以使用.na.fill
函数(它是org.apache.spark.sql.DataFrameNaFunctions 中的一个函数)。
Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame
基本上你需要的功能是: def fill(value: String, cols: Seq[String]): DataFrame
You can choose the columns, and you choose the value you want to replace the null or NaN.
您可以选择列,然后选择要替换空值或 NaN 的值。
In your case it will be something like:
在您的情况下,它将类似于:
val df2 = df.na.fill("a", Seq("Name"))
.na.fill("a2", Seq("Place"))
回答by ktheitroadalo
You can use DataFrame.na.fill()
to replace the null with some value
To update at once you can do as
您可以使用DataFrame.na.fill()
某个值替换空值以立即更新,您可以这样做
val map = Map("Name" -> "a", "Place" -> "a2")
df.na.fill(map).show()
But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with like
function.
但是,如果您也想替换坏记录,则需要先验证坏记录。您可以通过使用带有like
函数的正则表达式来做到这一点。
回答by Dan Carter
You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.
您需要使用数据框的 fill(String value, String[] columns) 方法,该方法会自动将给定列列表中的 Null 值替换为您指定的值。
So if you already know the value that you want to replace Null with...:
因此,如果您已经知道要将 Null 替换为...的值:
String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)
You can do the same for the rest of your columns.
您可以对其余列执行相同操作。
回答by PirateHyman
In order to replace the NULL values with a given string I've used fill
function present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-
为了用给定的字符串替换 NULL 值,我使用fill
了 Spark for Java 中存在的函数。它接受要替换的单词和一系列列名。这是我如何实施的:-
List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);