string R 使用“”将字符串转换为向量标记化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11927121/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:33:50  来源:igfitidea点击:

R convert string to vector tokenize using " "

stringrvector

提问by screechOwl

I have a string :

我有一个字符串:

string1 <- "This is my string"

I would like to convert it to a vector that looks like this:

我想将其转换为如下所示的向量:

vector1
"This"
"is"
"my"
"string"

How do I do this? I know I could use the tmpackage to convert to termDocumentMatrixand then convert to a matrix but it would alphabetize the words and I need them to stay in the same order.

我该怎么做呢?我知道我可以使用tm包转换为termDocumentMatrix矩阵,然后转换为矩阵,但它会将单词按字母顺序排列,我需要它们保持相同的顺序。

回答by Dason

You can use strsplit to accomplish this task.

您可以使用 strsplit 来完成此任务。

string1 <- "This is my string"
strsplit(string1, " ")[[1]]
#[1] "This"   "is"     "my"     "string"

回答by Sacha Epskamp

Slightly different from Dason, but this will split for any amount of white space including newlines:

与 Dason 略有不同,但这将拆分为任意数量的空格,包括换行符:

string1 <- "This   is my
string"
strsplit(string1, "\s+")[[1]]

回答by Shiqing Fan

As a supplement, we can also use unlist()to produce a vector from a given list structure:

作为补充,我们还可以使用unlist()从给定的列表结构生成向量:

string1 <- "This is my string"  # get a list structure
unlist(strsplit(string1, "\s+"))  # unlist the list
#[1] "This"   "is"     "my"     "string"

回答by Rich Scriven

If you're simply extracting words by splitting on the spaces, here are a couple of nice alternatives.

如果您只是通过拆分空格来提取单词,这里有几个不错的选择。

string1 <- "This is my string"

scan(text = string1, what = "")
# [1] "This"   "is"     "my"     "string"

library(stringi)
stri_split_fixed(string1, " ")[[1]]
# [1] "This"   "is"     "my"     "string"
stri_extract_all_words(string1, simplify = TRUE)
#      [,1]   [,2] [,3] [,4]    
# [1,] "This" "is" "my" "string"
stri_split_boundaries(string1, simplify = TRUE)
#      [,1]    [,2]  [,3]  [,4]    
# [1,] "This " "is " "my " "string" 

回答by russellpierce

Try:

尝试:

library(tm)
library("RWeka")
library(RWekajars)
NGramTokenizer(source1, Weka_control(min = 1, max = 1))

It is an over engineered solution for your problem. strsplit using Sacha's approach is generally just fine.

这是针对您的问题的过度设计的解决方案。使用 Sacha 的方法 strsplit 通常就可以了。