如何使用Java中的spark用Dataframe中的特定值替换空值？

Question

提问by PirateHyman

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

我正在尝试使用 Java 提高在 Spark 中实现的逻辑回归算法的准确性。为此，我试图用该列的最频繁值替换该列中存在的 Null 或无效值。例如：-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

在这种情况下，我会将“Name”列中的所有 NULL 值替换为“a”，将“Place”列中的所有 NULL 值替换为“a2”。到目前为止，我只能提取特定列中最频繁的列。关于如何用该列的最常见值替换空值或无效值的第二步，您能帮我吗？

Answer 1

采纳答案by Rami

You can use .na.fillfunction (it is a function in org.apache.spark.sql.DataFrameNaFunctions).

您可以使用.na.fill函数（它是org.apache.spark.sql.DataFrameNaFunctions 中的一个函数）。

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

基本上你需要的功能是： def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

您可以选择列，然后选择要替换空值或 NaN 的值。

In your case it will be something like:

在您的情况下，它将类似于：

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))

Answer 2

回答by ktheitroadalo

You can use DataFrame.na.fill()to replace the null with some value To update at once you can do as

您可以使用DataFrame.na.fill()某个值替换空值以立即更新，您可以这样做

val map = Map("Name" -> "a", "Place" -> "a2")

df.na.fill(map).show()

But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with likefunction.

但是，如果您也想替换坏记录，则需要先验证坏记录。您可以通过使用带有like函数的正则表达式来做到这一点。

Answer 3

回答by Dan Carter

You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

您需要使用数据框的 fill(String value, String[] columns) 方法，该方法会自动将给定列列表中的 Null 值替换为您指定的值。

So if you already know the value that you want to replace Null with...:

因此，如果您已经知道要将 Null 替换为...的值：

String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

您可以对其余列执行相同操作。

Answer 4

回答by PirateHyman

In order to replace the NULL values with a given string I've used fillfunction present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-

为了用给定的字符串替换 NULL 值，我使用fill了 Spark for Java 中存在的函数。它接受要替换的单词和一系列列名。这是我如何实施的：-

List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);

如何使用Java中的spark用Dataframe中的特定值替换空值？

提问by PirateHyman

采纳答案by Rami

回答by ktheitroadalo

回答by Dan Carter

回答by PirateHyman

相关推荐

最近更新

标签

如何使用Java中的spark用Dataframe中的特定值替换空值？

提问by PirateHyman

采纳答案by Rami

回答by ktheitroadalo

回答by Dan Carter

回答by PirateHyman

相关推荐

Java正则表达式验证

Java 使用递归和 JSTL 在 jsp 上显示树结构

Java Android：如何将文件写入内部存储

如何遍历 stringtemplate 中的 java 列表？

相关推荐

最近更新

标签