在 Scala 中删除标点符号表单文本 - Spark
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30074109/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing punctuation marks form text in Scala - Spark
提问by Rozita
This is one sample of my data:
这是我的数据示例之一:
case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ().
I want remove all punctuation marks except dot(.) and also remove words with length < = 2, for example my expected output is :
我想删除除 dot(.) 之外的所有标点符号,并删除带有 的单词length < = 2,例如我的预期输出是:
case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid .
and this should be implemented in Scala , i've tried :
这应该在 Scala 中实现,我试过:
replaceAll( """\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
but doesn't work well , Can anybody help me?
但效果不佳,有人可以帮助我吗?
回答by Régis Jean-Gilles
Looking at the regex javadoc (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html), we see that the character class for punctuation is \p{Punct}and that we can remove a character from a character class using something as [a-z&&[^def]]. From then it is easy to define a regex that will remove all punctuation except the dot:
查看正则表达式 javadoc ( http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),我们看到标点符号的字符类是\p{Punct},我们可以删除一个字符类中的字符,使用[a-z&&[^def]]. 从那时起,很容易定义一个正则表达式,它将删除除点之外的所有标点符号:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
Removing words with size <= 2 could be done like so:
可以像这样删除大小 <= 2 的单词:
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
Combining the two, this gives:
结合两者,这给出:
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Note how I added \s*to remove redundant spaces.
请注意我如何添加\s*以删除多余的空格。
Also, you can see that the above regex entirely removes '$', because it isa punctuation character (as defined by unicode).
If that is undesirable (as seems to indicate your expected output), please be more precise in what you consider punctuation.
By example you might want to consider only the following characters as punctuation: ?.!:():
此外,您可以看到上面的正则表达式完全删除了“$”,因为它是一个标点符号(由 unicode 定义)。如果这是不可取的(似乎表明您的预期输出),请更准确地考虑标点符号。例如,您可能只想将以下字符视为标点符号?.!:():
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Alternatively, you could just add '$' to your "not-punctuation" character-list, along with the dot:
或者,您可以将 '$' 添加到您的“非标点符号”字符列表中,以及点:
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
回答by Stefan Sigurdsson
How about this:
这个怎么样:
replaceAll("(\(|\)|'|/", "")
Then you just add more punctuation to remove using |, and make sure to escape characters like ( and ) with double backslashes?
然后您只需添加更多标点符号以使用 | 删除,并确保使用双反斜杠转义 ( 和 ) 等字符?
回答by Duzzz
You can try filtering the string like this:
您可以尝试像这样过滤字符串:
val example = "Hey there! It's me, myself and I."
example.filterNot(x => x == ',' || x == '!' || x == 'm')
res3: String = Hey there It's e yself and I.
回答by tuxdna
Try this, it shall work:
试试这个,它会起作用:
val str = """
|case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
|xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ().
""".stripMargin('|')
println(str)
val pat = """[^\w\s\.$]"""
val pat2 = """\s\w{2}\s"""
println(str.replaceAll(pat, "").replaceAll(pat2, ""))
OUTPUT:
输出:
case time especially its purse read manual care follow care instructions make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid .

