string R 使用“”将字符串转换为向量标记化
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11927121/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
R convert string to vector tokenize using " "
提问by screechOwl
I have a string :
我有一个字符串:
string1 <- "This is my string"
I would like to convert it to a vector that looks like this:
我想将其转换为如下所示的向量:
vector1
"This"
"is"
"my"
"string"
How do I do this? I know I could use the tm
package to convert to termDocumentMatrix
and then convert to a matrix but it would alphabetize the words and I need them to stay in the same order.
我该怎么做呢?我知道我可以使用tm
包转换为termDocumentMatrix
矩阵,然后转换为矩阵,但它会将单词按字母顺序排列,我需要它们保持相同的顺序。
回答by Dason
You can use strsplit to accomplish this task.
您可以使用 strsplit 来完成此任务。
string1 <- "This is my string"
strsplit(string1, " ")[[1]]
#[1] "This" "is" "my" "string"
回答by Sacha Epskamp
Slightly different from Dason, but this will split for any amount of white space including newlines:
与 Dason 略有不同,但这将拆分为任意数量的空格,包括换行符:
string1 <- "This is my
string"
strsplit(string1, "\s+")[[1]]
回答by Shiqing Fan
As a supplement, we can also use unlist()
to produce a vector from a given list structure:
作为补充,我们还可以使用unlist()
从给定的列表结构生成向量:
string1 <- "This is my string" # get a list structure
unlist(strsplit(string1, "\s+")) # unlist the list
#[1] "This" "is" "my" "string"
回答by Rich Scriven
If you're simply extracting words by splitting on the spaces, here are a couple of nice alternatives.
如果您只是通过拆分空格来提取单词,这里有几个不错的选择。
string1 <- "This is my string"
scan(text = string1, what = "")
# [1] "This" "is" "my" "string"
library(stringi)
stri_split_fixed(string1, " ")[[1]]
# [1] "This" "is" "my" "string"
stri_extract_all_words(string1, simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "my" "string"
stri_split_boundaries(string1, simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This " "is " "my " "string"
回答by russellpierce
Try:
尝试:
library(tm)
library("RWeka")
library(RWekajars)
NGramTokenizer(source1, Weka_control(min = 1, max = 1))
It is an over engineered solution for your problem. strsplit using Sacha's approach is generally just fine.
这是针对您的问题的过度设计的解决方案。使用 Sacha 的方法 strsplit 通常就可以了。