Python PySpark - 字符串匹配以创建新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46410887/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PySpark - String matching to create new column
提问by Ashley O
I have a dataframe like:
我有一个数据框,如:
ID Notes
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:
例如,假设只有 3 名员工需要检查:John、Stacy 或 Marsha。我想像这样创建一个新专栏:
ID Notes Employee
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Is regex or grep better here? What kind of function should I try? Thanks!
regex 或 grep 在这里更好吗?我应该尝试什么样的功能?谢谢!
EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:
编辑:我一直在尝试一堆解决方案,但似乎没有任何效果。我应该放弃并为每个员工创建一个二进制值的列吗?IE:
ID Notes John Stacy Marsha
2345 Checked by John 1 0 0
2398 Verified by Stacy 0 1 0
3983 Double Checked on 2/23/17 by Marsha 0 0 1
回答by mrsrinivas
In short:
简而言之:
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
This expression extracts employee namefrom any positionwhere it is after bythen space(s)in text column(
col('Notes')
)
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
该表达式中提取雇员名从任何位置,其中它是后通过随后用空间(S)中的文字列(
col('Notes')
)
In Detail:
详细:
Create a sample dataframe
创建示例数据框
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
Do the needed imports
做需要的进口
from pyspark.sql.functions import regexp_extract, col
On df
extract Employee
name from column using regexp_extract(column_name, regex, group_number)
.
在df
提取Employee
从列名使用regexp_extract(column_name, regex, group_number)
。
Here regex('(.)(by)(\s+)(\w+)'
) means
这里正则表达式( '(.)(by)(\s+)(\w+)'
) 表示
- (.)- Any character (except newline)
- (by)- Word byin the text
- (\s+)- One or many spaces
- (\w+)- Alphanumeric or underscore chars of length one
- (.)- 任何字符(换行符除外)
- (通过)-字的文本
- (\s+)- 一个或多个空格
- (\w+)- 长度为一的字母数字或下划线字符
and group_numberis 4 because group (\w+)
is in 4th position in expression
并且group_number为 4,因为 group(\w+)
在表达式中位于第 4 位
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
Note:
笔记:
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))
seems much cleaner version and check the Regex in use here
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))
似乎更干净的版本并检查此处使用的正则表达式
回答by ctwheels
Brief
简短的
In its simplest form, and according to the example provided, this answer should suffice, albeit the OP should post more samples if other samples exist where the name should be preceded by any word other than by
.
以最简单的形式,根据所提供的示例,这个答案应该足够了,尽管如果存在其他示例,名称应该以除by
.
Code
代码
Regex
正则表达式
^(\w+)[ \t]*(.*\bby[ \t]+(\w+)[ \t]*.*)$
Replacement
替代品
\t\t
Results
结果
Input
输入
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Output
输出
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Note:The above output separates each column by the tab \t
character, so it may not appear to be correct to the naked eye, but simply using an online regex parser and inserting \t
into the regex match section should show you where each column begins/ends.
注意:上面的输出用制表\t
符分隔每一列,所以肉眼看起来可能不正确,但只需使用在线正则表达式解析器并插入\t
正则表达式匹配部分,即可显示每列的开始/结束位置。
Explanation
解释
Regex
正则表达式
^
Assert position at the beginning of the line(\w+)
Capture one or more word characters (a-zA-Z0-9_
) into group 1[ \t]*
Match any number of spaces or tab characters ([ \t]
can be replaced with\h
in some regex flavours such as PCRE)(.*\bby[ \t]+(\w+)[ \t]*.*)
Capture the following into group 2.*
Match any character (except newline unless thes
modifier is used)\bby
Match a word boundary\b
, followed byby
literally[ \t]+
Match one or more spaces or tab characters(\w+)
Capture one or more word characters (a-zA-Z0-9_
) into group 3[ \t]*
Match any number of spaces or tab characters.*
Match any character any number of times
$
Assert position at the end of the line
^
在行首断言位置(\w+)
捕获一个或多个单词字符 (a-zA-Z0-9_
) 到组 1[ \t]*
匹配任意数量的空格或制表符([ \t]
可以用\h
某些正则表达式替换,例如 PCRE)(.*\bby[ \t]+(\w+)[ \t]*.*)
将以下内容捕获到第 2 组.*
匹配任何字符(换行符除外,除非使用s
修饰符)\bby
匹配一个词边界\b
,然后是by
字面意思[ \t]+
匹配一个或多个空格或制表符(\w+)
捕获一个或多个单词字符 (a-zA-Z0-9_
) 到第 3 组[ \t]*
匹配任意数量的空格或制表符.*
匹配任意字符任意次数
$
在行尾断言位置
Replacement
替代品
\1
Matches the same text as most recently matched by the 1st capturing group\t
Tab character\1
Matches the same text as most recently matched by the 2nd capturing group\t
Tab character\1
Matches the same text as most recently matched by the 3rd capturing group
\1
匹配与第一个捕获组最近匹配的相同文本\t
制表符\1
匹配与第二个捕获组最近匹配的相同文本\t
制表符\1
匹配与第三个捕获组最近匹配的相同文本
回答by Matschek
When I read the question again, the OP may speak of a fixed list of employees ("Let's say for example there are only 3 employeesto check: John, Stacy, or Marsha"). If this is really a known list, then the simplest way is to check against this list of names with word boundaries:
当我再次阅读这个问题时,OP 可能会提到一个固定的员工列表(“例如,假设只有 3 个员工需要检查:John、Stacy 或 Marsha”)。如果这确实是一个已知列表,那么最简单的方法是检查这个带有单词边界的名称列表:
regexp_extract(col('Notes'), '\b(John|Stacy|Marsha)\b', 1)
回答by Avishek Bhattacharya
Something like this should work
这样的事情应该工作
import org.apache.spark.sql.functions._
dataFrame.withColumn("Employee", substring_index(col("Notes"), "\t", 2))
In case you want to use regex to extract the proper value you need something like
如果您想使用正则表达式来提取正确的值,您需要类似的东西
dataFrame.withColumn("Employee", regexp_extract(col("Notes"), 'regex', <groupId>)