string 在 R 中,如何用另一个字符串替换包含特定模式的字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5302669/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
In R, how do I replace a string that contains a certain pattern with another string?
提问by Alan
I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub()
to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major)
but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major)
, it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.
我正在从事一个涉及清理大学专业数据列表的项目。我发现很多拼写错误,因此我希望使用该函数gsub()
将拼写错误的内容替换为正确的拼写。例如,假设“biolgy”在名为 Major 的专业列表中拼错了。如何让 R 检测拼写错误并将其替换为正确的拼写?我试过了,gsub('biol', 'Biology', Major)
但这只能替换“biolgy”中的前四个字母。如果我这样做gsub('biolgy', 'Biology', Major)
,它仅适用于这种情况,但这并不能检测到其他形式的“生物学”拼写错误。
Thank you!
谢谢!
回答by aL3xa
You should either define some nifty regular expression, or use agrep
from base
package. stringr
package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.
您应该定义一些漂亮的正则表达式,或者使用agrep
from base
package。stringr
package 是另一种选择,我知道人们使用它,但我非常喜欢正则表达式,所以它对我来说是禁忌。
Anyway, agrep
should do the trick:
无论如何,agrep
应该做的伎俩:
agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1
EDIT:
编辑:
You should also use ignore.case = TRUE
, but be prepared to do some bookkeeping "by hand"...
您还应该使用ignore.case = TRUE
,但要准备好“手工”做一些簿记...
回答by Spacedman
You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:
您可以设置所有可能拼写错误的向量,然后对 gsub 调用进行循环。就像是:
biologySp = c("biolgy","biologee","bologee","bugs")
for(sp in biologySp){
Major = gsub(sp,"Biology",Major)
}
If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....
如果你想做一些更聪明的事情,看看CRAN上是否有任何模糊匹配包,或者使用'soundex'匹配的东西......
The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.
维基百科页面约。字符串匹配可能有用,并尝试在 R-help 中搜索一些关键术语。
回答by Greg Snow
You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).
您可以首先将专业与可用专业列表进行匹配,任何不匹配的都可能是拼写错误。然后使用 agrep 函数再次将这些与已知专业进行匹配(agrep 进行近似匹配,因此如果它与正确值相似,那么您将获得匹配)。
回答by Spacedman
The vwr package has methods for string matching:
vwr 包具有用于字符串匹配的方法:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:
所以你最好的选择可能是使用与可能的主题字符串具有最小 Levenshtein 距离的字符串:
> levenshtein.distance("physcs",c("biology","physics","geography"))
biology physics geography
7 1 9
If you get identical minima then flip a coin:
如果你得到相同的最小值,然后抛硬币:
> levenshtein.distance("biolsics",c("biology","physics","geography"))
biology physics geography
4 4 8
回答by cspoleta
example 1a) perl/linux regex: 's/oldstring/newstring/'
示例 1a) perl/linux 正则表达式: 's/oldstring/newstring/'
example 1b) R equivalent of 1a: srcstring=sub(oldstring, newstring, srcstring)
示例 1b) 相当于 1a 的 R: srcstring=sub(oldstring, newstring, srcstring)
example 2a) perl/linux regex: 's/oldstring//'
示例 2a) perl/linux 正则表达式: 's/oldstring//'
example 2b) R equivalent of 2a: srcstring=sub(oldstring, "", srcstring)
示例 2b) 相当于 2a 的 R: srcstring=sub(oldstring, "", srcstring)