Python 从 Spark 中的数据框列值中删除空格
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35540974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove blank space from data frame column values in Spark
提问by Iz M
I have a data frame (business_df
) of schema:
我有一个business_df
架构的数据框 ( ):
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)
I want to make a new data frame (new_df
) so that the values in the 'name'
column do not contain any blank spaces.
我想创建一个新的数据框 ( new_df
),以便'name'
列中的值不包含任何空格。
My code is:
我的代码是:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()
I keep receiving this error:
我不断收到此错误:
NameError: global name 'replace' is not defined
What's wrong with this code?
这段代码有什么问题?
采纳答案by zero323
While the problem you've described is not reproducible with provided code, using Python UDFs
to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace
:
虽然您所描述的问题无法用提供的代码重现,但使用 PythonUDFs
处理这样的简单任务效率很低。如果您只想从文本中删除空格,请使用regexp_replace
:
from pyspark.sql.functions import regexp_replace, col
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("v"), " ", ""))
If you want to normalize empty lines use trim
:
如果要规范化空行,请使用trim
:
from pyspark.sql.functions import trim
df.select(trim(col("v")))
If you want to keep leading / trailing spaces you can adjust regexp_replace
:
如果您想保留前导/尾随空格,您可以调整regexp_replace
:
df.select(regexp_replace(col("v"), "^\s+$", ""))
回答by Alberto Bonsanto
As @zero323 said, it's probably that you overlapped the replace
function somewhere. I tested your code and it works perfectly.
正如@zero323 所说,可能是您在replace
某处重叠了该功能。我测试了你的代码,它工作得很好。
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()
#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+
回答by Powers
Here's a function that removes all whitespace in a string:
这是一个删除字符串中所有空格的函数:
import pyspark.sql.functions as F
def remove_all_whitespace(col):
return F.regexp_replace(col, "\s+", "")
You can use the function like this:
您可以像这样使用该函数:
actual_df = source_df.withColumn(
"words_without_whitespace",
quinn.remove_all_whitespace(col("words"))
)
The remove_all_whitespace
function is defined in the quinn library. quinn also defines single_space
and anti_trim
methods to manage whitespace. PySpark defines ltrim
, rtrim
, and trim
methods to manage whitespace.
该remove_all_whitespace
函数在quinn 库中定义。奎因还定义single_space
和anti_trim
方法来管理空白。PySpark 定义了ltrim
、rtrim
和trim
方法来管理空白。
回答by Andre Carneiro
I think that solution using regexp_replace too slow even for few data! So I've tried to find another way and I think I found it!
我认为即使对于很少的数据,使用 regexp_replace 的解决方案也太慢了!所以我试图找到另一种方法,我想我找到了!
Not beaultiful, little naive, but it's fast! What do you think?
不漂亮,有点天真,但速度很快!你怎么认为?
def normalizeSpace(df,colName):
# Left and right trim
df = df.withColumn(colName,ltrim(df[colName]))
df = df.withColumn(colName,rtrim(df[colName]))
#This is faster than regexp_replace function!
def normalize(row,colName):
data = row.asDict()
text = data[colName]
spaceCount = 0;
Words = []
word = ''
for char in text:
if char != ' ':
word += char
elif word == '' and char == ' ':
continue
else:
Words.append(word)
word = ''
if len(Words) > 0:
data[colName] = ' '.join(Words)
return Row(**data)
df = df.rdd.map(lambda row:
normalize(row,colName)
).toDF()
return df
schema = StructType([StructField('name',StringType())])
rows = [Row(name=' dvd player samsung hdmi hdmi 160W reais de potencia
bivolt ')]
df = spark.createDataFrame(rows, schema)
df = normalizeSpace(df,'name')
df.show(df.count(),False)
That prints
那个打印
+---------------------------------------------------+
|name |
+---------------------------------------------------+
|dvd player samsung hdmi hdmi 160W reais de potencia|
+---------------------------------------------------+
回答by Horbaje
As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinnHere are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html
正如@Powers 所示,有一个非常好用且易于阅读的函数来删除一个名为 quinn 的包提供的空格。您可以在此处找到它:https: //github.com/MrPowers/quinn以下是有关如何操作的说明如果在 Data Bricks 工作区上工作,请安装它:https: //docs.databricks.com/libraries.html
Here again an illustration of how it works:
这里再次说明它是如何工作的:
#import library
import quinn
#create an example dataframe
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
"v",
quinn.remove_all_whitespace(col("v"))
)