Python PySpark - 字符串匹配以创建新列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46410887/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:37:41  来源:igfitidea点击:

PySpark - String matching to create new column

pythonregexapache-sparkpysparkdatabricks

提问by Ashley O

I have a dataframe like:

我有一个数据框,如:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha 

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

例如,假设只有 3 名员工需要检查:John、Stacy 或 Marsha。我想像这样创建一个新专栏:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

Is regex or grep better here? What kind of function should I try? Thanks!

regex 或 grep 在这里更好吗?我应该尝试什么样的功能?谢谢!

EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

编辑:我一直在尝试一堆解决方案,但似乎没有任何效果。我应该放弃并为每个员工创建一个二进制值的列吗?IE:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1

回答by mrsrinivas

In short:

简而言之:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee namefrom any positionwhere it is after bythen space(s)in text column(col('Notes'))

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

该表达式中提取雇员名任何位置,其中它是后通过随后用空间(S)中的文字列(col('Notes')



In Detail:

详细:

Create a sample dataframe

创建示例数据框

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

做需要的进口

from pyspark.sql.functions import regexp_extract, col

On dfextract Employeename from column using regexp_extract(column_name, regex, group_number).

df提取Employee从列名使用regexp_extract(column_name, regex, group_number)

Here regex('(.)(by)(\s+)(\w+)') means

这里正则表达式( '(.)(by)(\s+)(\w+)') 表示

  • (.)- Any character (except newline)
  • (by)- Word byin the text
  • (\s+)- One or many spaces
  • (\w+)- Alphanumeric or underscore chars of length one
  • (.)- 任何字符(换行符除外)
  • (通过)-字文本
  • (\s+)- 一个或多个空格
  • (\w+)- 长度为一的字母数字或下划线字符

and group_numberis 4 because group (\w+)is in 4th position in expression

并且group_number为 4,因为 group(\w+)在表达式中位于第 4 位

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks notebook

Databricks 笔记本

Note:

笔记:

regexp_extract(col('Notes'), '.by\s+(\w+)', 1))seems much cleaner version and check the Regex in use here

regexp_extract(col('Notes'), '.by\s+(\w+)', 1))似乎更干净的版本并检查此处使用的正则表达式

回答by ctwheels

Brief

简短的

In its simplest form, and according to the example provided, this answer should suffice, albeit the OP should post more samples if other samples exist where the name should be preceded by any word other than by.

以最简单的形式,根据所提供的示例,这个答案应该足够了,尽管如果存在其他示例,名称应该以除by.



Code

代码

See code in use here

在此处查看正在使用的代码

Regex

正则表达式

^(\w+)[ \t]*(.*\bby[ \t]+(\w+)[ \t]*.*)$

Replacement

替代品

\t\t


Results

结果

Input

输入

2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha 

Output

输出

2345    Checked by John John
2398    Verified by Stacy   Stacy
3983    Double Checked on 2/23/17 by Marsha     Marsha

Note:The above output separates each column by the tab \tcharacter, so it may not appear to be correct to the naked eye, but simply using an online regex parser and inserting \tinto the regex match section should show you where each column begins/ends.

注意:上面的输出用制表\t符分隔每一列,所以肉眼看起来可能不正确,但只需使用在线正则表达式解析器并插入\t正则表达式匹配部分,即可显示每列的开始/结束位置。



Explanation

解释

Regex

正则表达式

  • ^Assert position at the beginning of the line
  • (\w+)Capture one or more word characters (a-zA-Z0-9_) into group 1
  • [ \t]*Match any number of spaces or tab characters ([ \t]can be replaced with \hin some regex flavours such as PCRE)
  • (.*\bby[ \t]+(\w+)[ \t]*.*)Capture the following into group 2
    • .*Match any character (except newline unless the smodifier is used)
    • \bbyMatch a word boundary \b, followed by byliterally
    • [ \t]+Match one or more spaces or tab characters
    • (\w+)Capture one or more word characters (a-zA-Z0-9_) into group 3
    • [ \t]*Match any number of spaces or tab characters
    • .*Match any character any number of times
  • $Assert position at the end of the line
  • ^在行首断言位置
  • (\w+)捕获一个或多个单词字符 ( a-zA-Z0-9_) 到组 1
  • [ \t]*匹配任意数量的空格或制表符([ \t]可以用\h某些正则表达式替换,例如 PCRE)
  • (.*\bby[ \t]+(\w+)[ \t]*.*)将以下内容捕获到第 2 组
    • .*匹配任何字符(换行符除外,除非使用s修饰符)
    • \bby匹配一个词边界\b,然后是by字面意思
    • [ \t]+匹配一个或多个空格或制表符
    • (\w+)捕获一个或多个单词字符 ( a-zA-Z0-9_) 到第 3 组
    • [ \t]*匹配任意数量的空格或制表符
    • .*匹配任意字符任意次数
  • $在行尾断言位置

Replacement

替代品

  • \1Matches the same text as most recently matched by the 1st capturing group
  • \tTab character
  • \1Matches the same text as most recently matched by the 2nd capturing group
  • \tTab character
  • \1Matches the same text as most recently matched by the 3rd capturing group
  • \1匹配与第一个捕获组最近匹配的相同文本
  • \t制表符
  • \1匹配与第二个捕获组最近匹配的相同文本
  • \t制表符
  • \1匹配与第三个捕获组最近匹配的相同文本

回答by Matschek

When I read the question again, the OP may speak of a fixed list of employees ("Let's say for example there are only 3 employeesto check: John, Stacy, or Marsha"). If this is really a known list, then the simplest way is to check against this list of names with word boundaries:

当我再次阅读这个问题时,OP 可能会提到一个固定的员工列表(“例如,假设只有 3 个员工需要检查:John、Stacy 或 Marsha”)。如果这确实是一个已知列表,那么最简单的方法是检查这个带有单词边界的名称列表:

regexp_extract(col('Notes'), '\b(John|Stacy|Marsha)\b', 1)

回答by Avishek Bhattacharya

Something like this should work

这样的事情应该工作

import org.apache.spark.sql.functions._
dataFrame.withColumn("Employee", substring_index(col("Notes"), "\t", 2))

In case you want to use regex to extract the proper value you need something like

如果您想使用正则表达式来提取正确的值,您需要类似的东西

 dataFrame.withColumn("Employee", regexp_extract(col("Notes"), 'regex', <groupId>)