Python 从 Spark 中的数据框列值中删除空格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35540974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:34:36  来源:igfitidea点击:

Remove blank space from data frame column values in Spark

pythonapache-sparkdataframeapache-spark-sql

提问by Iz M

I have a data frame (business_df) of schema:

我有一个business_df架构的数据框 ( ):

|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)

I want to make a new data frame (new_df) so that the values in the 'name'column do not contain any blank spaces.

我想创建一个新的数据框 ( new_df),以便'name'列中的值不包含任何空格。

My code is:

我的代码是:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()

I keep receiving this error:

我不断收到此错误:

NameError: global name 'replace' is not defined

What's wrong with this code?

这段代码有什么问题?

采纳答案by zero323

While the problem you've described is not reproducible with provided code, using Python UDFsto handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace:

虽然您所描述的问题无法用提供的代码重现,但使用 PythonUDFs处理这样的简单任务效率很低。如果您只想从文本中删除空格,请使用regexp_replace

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

If you want to normalize empty lines use trim:

如果要规范化空行,请使用trim

from pyspark.sql.functions import trim

df.select(trim(col("v")))

If you want to keep leading / trailing spaces you can adjust regexp_replace:

如果您想保留前导/尾随空格,您可以调整regexp_replace

df.select(regexp_replace(col("v"), "^\s+$", ""))

回答by Alberto Bonsanto

As @zero323 said, it's probably that you overlapped the replacefunction somewhere. I tested your code and it works perfectly.

正如@zero323 所说,可能是您在replace某处重叠了该功能。我测试了你的代码,它工作得很好。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()

#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+

回答by Powers

Here's a function that removes all whitespace in a string:

这是一个删除字符串中所有空格的函数:

import pyspark.sql.functions as F

def remove_all_whitespace(col):
    return F.regexp_replace(col, "\s+", "")

You can use the function like this:

您可以像这样使用该函数:

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

The remove_all_whitespacefunction is defined in the quinn library. quinn also defines single_spaceand anti_trimmethods to manage whitespace. PySpark defines ltrim, rtrim, and trimmethods to manage whitespace.

remove_all_whitespace函数在quinn 库中定义。奎因还定义single_spaceanti_trim方法来管理空白。PySpark 定义了ltrimrtrimtrim方法来管理空白。

回答by Andre Carneiro

I think that solution using regexp_replace too slow even for few data! So I've tried to find another way and I think I found it!

我认为即使对于很少的数据,使用 regexp_replace 的解决方案也太慢了!所以我试图找到另一种方法,我想我找到了!

Not beaultiful, little naive, but it's fast! What do you think?

不漂亮,有点天真,但速度很快!你怎么认为?

def normalizeSpace(df,colName):

  # Left and right trim
  df = df.withColumn(colName,ltrim(df[colName]))
  df = df.withColumn(colName,rtrim(df[colName]))

  #This is faster than regexp_replace function!
  def normalize(row,colName):
      data = row.asDict()
      text = data[colName]
      spaceCount = 0;
      Words = []
      word = ''

      for char in text:
          if char != ' ':
              word += char
          elif word == '' and char == ' ':
              continue
          else:
              Words.append(word)
              word = ''

      if len(Words) > 0:
          data[colName] = ' '.join(Words)

      return Row(**data)

      df = df.rdd.map(lambda row:
                     normalize(row,colName)
                 ).toDF()
      return df
schema = StructType([StructField('name',StringType())])
rows = [Row(name='  dvd player samsung   hdmi hdmi 160W reais    de potencia 
bivolt   ')]
df = spark.createDataFrame(rows, schema)
df = normalizeSpace(df,'name')
df.show(df.count(),False)

That prints

那个打印

+---------------------------------------------------+
|name                                               |
+---------------------------------------------------+
|dvd player samsung hdmi hdmi 160W reais de potencia|
+---------------------------------------------------+

回答by Horbaje

As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinnHere are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html

正如@Powers 所示,有一个非常好用且易于阅读的函数来删除一个名为 quinn 的包提供的空格。您可以在此处找到它:https: //github.com/MrPowers/quinn以下是有关如何操作的说明如果在 Data Bricks 工作区上工作,请安装它:https: //docs.databricks.com/libraries.html

Here again an illustration of how it works:

这里再次说明它是如何工作的:

#import library 
import quinn

#create an example dataframe
df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
    "v",
    quinn.remove_all_whitespace(col("v"))
)

The output: enter image description here

输出: 在此处输入图片说明