string 计算字符串中所有单词的数量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8920145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count the number of all words in a string
提问by John
Is there a function to count the number of words in a string? For example:
是否有计算字符串中单词数的函数?例如:
str1 <- "How many words are in this sentence"
to return a result of 7.
返回 7 的结果。
采纳答案by AVSuresh
You can use strsplit
and sapply
functions
您可以使用strsplit
和sapply
功能
sapply(strsplit(str1, " "), length)
回答by Martin Morgan
Use the regular expression symbol \\W
to match non-word characters, using +
to indicate one or more in a row, along with gregexpr
to find all matches in a string. Words are the number of word separators plus 1.
使用正则表达式符号\\W
匹配非单词字符,+
用于表示一行中的一个或多个,以及gregexpr
查找字符串中的所有匹配项。单词是单词分隔符的数量加 1。
lengths(gregexpr("\W+", str1)) + 1
This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W
's notion of non-word (one could work with other regular expressions, \\S+
, [[:alpha:]]
, etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit
solutions, which will allocate memory for each word. Regular expressions are described in ?regex
.
这将在字符向量的开头或结尾处出现空白字符串而失败,当“单词”不满足\\W
的非单词概念时(可以使用其他正则表达式\\S+
,[[:alpha:]]
, 等,但总会有是使用正则表达式方法的边缘情况)等。它可能比strsplit
解决方案更有效,后者将为每个单词分配内存。中描述了正则表达式?regex
。
UpdateAs noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation
更新如评论和@Andri 的不同答案中所述,该方法失败,带有(零)和一个单词字符串,并带有尾随标点符号
str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\W+", str1)) + 1L
# [1] 2 2 2 3 3
Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+
), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might
许多其他答案在这些或类似(例如,多个空格)的情况下也失败了。我认为我的答案在原始答案中关于“一个词的概念”的警告涵盖了标点问题(解决方案:选择不同的正则表达式,例如,[[:space:]]+
),但零和一个词的情况是一个问题;@Andri 的解决方案无法区分零个单词和一个单词。因此,采取一种“积极”的方法来寻找可能的词
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
Leading to
导致
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3
Again the regular expression might be refined for different notions of 'word'.
同样,正则表达式可能会针对“单词”的不同概念进行细化。
I like the use of gregexpr()
because it's memory efficient. An alternative using strsplit()
(like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is
我喜欢使用,gregexpr()
因为它的内存效率很高。另一种使用strsplit()
(如@user813966,但使用正则表达式来分隔单词)并利用分隔单词的原始概念是
lengths(strsplit(str1, "\W+"))
# [1] 0 1 2 2 3
This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.
这需要为创建的每个单词以及中间单词列表分配新的内存。当数据“大”时,这可能相对昂贵,但对于大多数目的来说它可能是有效且易于理解的。
回答by petermeissner
Most simple waywould be:
最简单的方法是:
require(stringr)
str_count("one, two three 4,,,, 5 6", "\S+")
... counting all sequences on non-space characters (\\S+
).
... 计算非空格字符 ( \\S+
)上的所有序列。
But what about a little function that lets us also decide which kind of wordswe would like to count and which works on whole vectorsas well?
但是关于一个小功能,让我们也决定什么哪一种的话,我们想计算和对整个向量工作呢?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6
回答by arekolek
I use the str_count
function from the stringr
library with the escape sequence \w
that represents:
我str_count
将stringr
库中的函数与\w
表示的转义序列一起使用:
any ‘word' character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)
任何“单词”字符(当前语言环境中的字母、数字或下划线:在 UTF-8 模式下仅考虑 ASCII 字母和数字)
Example:
例子:
> str_count("How many words are in this sentence", '\w+')
[1] 7
Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr
.
在我能够测试的所有其他 9 个答案中,只有两个(由 Vincent Zoonekynd 和 petermeissner 提供)适用于迄今为止此处提供的所有输入,但它们也需要stringr
.
But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs"
or "Combien de mots sont dans cette phrase ?"
.
但只有此解决方案适用于迄今为止提供的所有输入,以及诸如"foo+bar+baz~spam+eggs"
或 之类的输入"Combien de mots sont dans cette phrase ?"
。
Benchmark:
基准:
library(stringr)
questions <-
c(
"", "x", "x y", "x y!", "x y! z",
"foo+bar+baz~spam+eggs",
"one, two three 4,,,, 5 6",
"How many words are in this sentence",
"How many words are in this sentence",
"Combien de mots sont dans cette phrase ?",
"
Day after day, day after day,
We stuck, nor breath nor motion;
"
)
answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)
score <- function(f) sum(unlist(lapply(questions, f)) == answers)
funs <-
c(
function(s) sapply(gregexpr("\W+", s), length) + 1,
function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
function(s) vapply(strsplit(s, "\W+"), length, integer(1)),
function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
function(s) length(str_match_all(s, "\S+")[[1]]),
function(s) str_count(s, "\S+"),
function(s) sapply(gregexpr("\W+", s), function(x) sum(x > 0)) + 1,
function(s) length(unlist(strsplit(s," "))),
function(s) sapply(strsplit(s, " "), length),
function(s) str_count(s, '\w+')
)
unlist(lapply(funs, score))
Output:
输出:
6 10 10 8 9 9 7 6 6 11
回答by mathematical.coffee
str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])
The gsub(' {2,}',' ',str1)
makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.
该gsub(' {2,}',' ',str1)
品牌确保所有单词都只有一个空格分开,用一个空格替换两个或多个空格的所有出现。
The strsplit(str,' ')
splits the sentence at every space and returns the result in a list. The [[1]]
grabs the vector of words out of that list. The length
counts up how many words.
将strsplit(str,' ')
在每一个空间分割句子并返回结果列表中。该[[1]]
抓住的话的矢量指出,名单中。该length
计数多少字。
> str1 <- "How many words are in this sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> strsplit(str2,' ')[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7
回答by Vincent Zoonekynd
You can use str_match_all
, with a regular expression that would identify your words.
The following works with initial, final and duplicated spaces.
您可以将str_match_all
, 与可以识别您的单词的正则表达式一起使用。以下适用于初始、最终和重复空间。
library(stringr)
s <- "
Day after day, day after day,
We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\S+" ) # Sequences of non-spaces
length(m[[1]])
回答by bartektartanus
Try this function from stringi
package
从stringi
包中试试这个功能
require(stringi)
> s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
+ "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
+ "Cras vel lorem. Etiam pellentesque aliquet tellus.",
+ "")
> stri_stats_latex(s)
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
133 0 30 24 0 0
回答by yuqian
You can use wcfunction in library qdap:
您可以在库qdap 中使用wc函数:
> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7
回答by Murali Menon
You can remove double spaces and count the number of " "
in the string to get the count of words. Use stringrand rm_white
{qdapRegex}
您可以删除双空格并计算" "
字符串中的数量以获得单词数。使用 stringr和rm_white
{ qdapRegex}
str_count(rm_white(s), " ") +1
回答by Sangram
Try this
尝试这个
length(unlist(strsplit(str1," ")))