Python 从 Spark 中的数据框列值中删除空格

Question

提问by Iz M

I have a data frame (business_df) of schema:

我有一个business_df架构的数据框 ( )：

|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)

I want to make a new data frame (new_df) so that the values in the 'name'column do not contain any blank spaces.

我想创建一个新的数据框 ( new_df)，以便'name'列中的值不包含任何空格。

My code is:

我的代码是：

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()

I keep receiving this error:

我不断收到此错误：

NameError: global name 'replace' is not defined

What's wrong with this code?

这段代码有什么问题？

Answer 1

采纳答案by zero323

While the problem you've described is not reproducible with provided code, using Python UDFsto handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace:

虽然您所描述的问题无法用提供的代码重现，但使用 PythonUDFs处理这样的简单任务效率很低。如果您只想从文本中删除空格，请使用regexp_replace：

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

If you want to normalize empty lines use trim:

如果要规范化空行，请使用trim：

from pyspark.sql.functions import trim

df.select(trim(col("v")))

If you want to keep leading / trailing spaces you can adjust regexp_replace:

如果您想保留前导/尾随空格，您可以调整regexp_replace：

df.select(regexp_replace(col("v"), "^\s+$", ""))

Answer 2

回答by Alberto Bonsanto

As @zero323 said, it's probably that you overlapped the replacefunction somewhere. I tested your code and it works perfectly.

正如@zero323 所说，可能是您在replace某处重叠了该功能。我测试了你的代码，它工作得很好。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()

#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+

Answer 3

回答by Powers

Here's a function that removes all whitespace in a string:

这是一个删除字符串中所有空格的函数：

import pyspark.sql.functions as F

def remove_all_whitespace(col):
    return F.regexp_replace(col, "\s+", "")

You can use the function like this:

您可以像这样使用该函数：

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

The remove_all_whitespacefunction is defined in the quinn library. quinn also defines single_spaceand anti_trimmethods to manage whitespace. PySpark defines ltrim, rtrim, and trimmethods to manage whitespace.

该remove_all_whitespace函数在quinn 库中定义。奎因还定义single_space和anti_trim方法来管理空白。PySpark 定义了ltrim、rtrim和trim方法来管理空白。

Answer 4

回答by Andre Carneiro

I think that solution using regexp_replace too slow even for few data! So I've tried to find another way and I think I found it!

我认为即使对于很少的数据，使用 regexp_replace 的解决方案也太慢了！所以我试图找到另一种方法，我想我找到了！

Not beaultiful, little naive, but it's fast! What do you think?

不漂亮，有点天真，但速度很快！你怎么认为？

def normalizeSpace(df,colName):

  # Left and right trim
  df = df.withColumn(colName,ltrim(df[colName]))
  df = df.withColumn(colName,rtrim(df[colName]))

  #This is faster than regexp_replace function!
  def normalize(row,colName):
      data = row.asDict()
      text = data[colName]
      spaceCount = 0;
      Words = []
      word = ''

      for char in text:
          if char != ' ':
              word += char
          elif word == '' and char == ' ':
              continue
          else:
              Words.append(word)
              word = ''

      if len(Words) > 0:
          data[colName] = ' '.join(Words)

      return Row(**data)

      df = df.rdd.map(lambda row:
                     normalize(row,colName)
                 ).toDF()
      return df
schema = StructType([StructField('name',StringType())])
rows = [Row(name='  dvd player samsung   hdmi hdmi 160W reais    de potencia 
bivolt   ')]
df = spark.createDataFrame(rows, schema)
df = normalizeSpace(df,'name')
df.show(df.count(),False)

That prints

那个打印

+---------------------------------------------------+
|name                                               |
+---------------------------------------------------+
|dvd player samsung hdmi hdmi 160W reais de potencia|
+---------------------------------------------------+

Answer 5

回答by Horbaje

As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinnHere are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html

正如@Powers 所示，有一个非常好用且易于阅读的函数来删除一个名为 quinn 的包提供的空格。您可以在此处找到它：https: //github.com/MrPowers/quinn以下是有关如何操作的说明如果在 Data Bricks 工作区上工作，请安装它：https: //docs.databricks.com/libraries.html

Here again an illustration of how it works:

这里再次说明它是如何工作的：

#import library 
import quinn

#create an example dataframe
df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
    "v",
    quinn.remove_all_whitespace(col("v"))
)

The output:

输出：

Python 从 Spark 中的数据框列值中删除空格

提问by Iz M

采纳答案by zero323

回答by Alberto Bonsanto

回答by Powers

回答by Andre Carneiro

回答by Horbaje

相关推荐

最近更新

标签

Python 从 Spark 中的数据框列值中删除空格

提问by Iz M

采纳答案by zero323

回答by Alberto Bonsanto

回答by Powers

回答by Andre Carneiro

回答by Horbaje

相关推荐

Python 导入错误：没有名为 extern 的模块

如何使用 Python 将八进制转换为十进制

Python 无法在 Django 上运行服务器（连接被拒绝）

Python pyspark : 将 DataFrame 转换为 RDD[string]

相关推荐

最近更新

标签