string 计算字符串中所有单词的数量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8920145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:21:26  来源:igfitidea点击:

Count the number of all words in a string

rstringword-count

提问by John

Is there a function to count the number of words in a string? For example:

是否有计算字符串中单词数的函数?例如:

str1 <- "How many words are in this sentence"

to return a result of 7.

返回 7 的结果。

采纳答案by AVSuresh

You can use strsplitand sapplyfunctions

您可以使用strsplitsapply功能

sapply(strsplit(str1, " "), length)

回答by Martin Morgan

Use the regular expression symbol \\Wto match non-word characters, using +to indicate one or more in a row, along with gregexprto find all matches in a string. Words are the number of word separators plus 1.

使用正则表达式符号\\W匹配非单词字符,+用于表示一行中的一个或多个,以及gregexpr查找字符串中的所有匹配项。单词是单词分隔符的数量加 1。

lengths(gregexpr("\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplitsolutions, which will allocate memory for each word. Regular expressions are described in ?regex.

这将在字符向量的开头或结尾处出现空白字符串而失败,当“单词”不满足\\W的非单词概念时(可以使用其他正则表达式\\S+[[:alpha:]], 等,但总会有是使用正则表达式方法的边缘情况)等。它可能比strsplit解决方案更有效,后者将为每个单词分配内存。中描述了正则表达式?regex

UpdateAs noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

更新如评论和@Andri 的不同答案中所述,该方法失败,带有(零)和一个单词字符串,并带有尾随标点符号

str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\W+", str1)) + 1L
# [1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

许多其他答案在这些或类似(例如,多个空格)的情况下也失败了。我认为我的答案在原始答案中关于“一个词的概念”的警告涵盖了标点问题(解决方案:选择不同的正则表达式,例如,[[:space:]]+),但零和一个词的情况是一个问题;@Andri 的解决方案无法区分零个单词和一个单词。因此,采取一种“积极”的方法来寻找可能的词

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

导致

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

同样,正则表达式可能会针对“单词”的不同概念进行细化。

I like the use of gregexpr()because it's memory efficient. An alternative using strsplit()(like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

我喜欢使用,gregexpr()因为它的内存效率很高。另一种使用strsplit()(如@user813966,但使用正则表达式来分隔单词)并利用分隔单词的原始概念是

lengths(strsplit(str1, "\W+"))
# [1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

这需要为创建的每个单词以及中间单词列表分配新的内存。当数据“大”时,这可能相对昂贵,但对于大多数目的来说它可能是有效且易于理解的。

回答by petermeissner

Most simple waywould be:

最简单的方法是:

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\S+")

... counting all sequences on non-space characters (\\S+).

... 计算非空格字符 ( \\S+)上的所有序列。

But what about a little function that lets us also decide which kind of wordswe would like to count and which works on whole vectorsas well?

但是关于一个小功能,让我们也决定什么哪一种的话,我们想计算和对整个向量工作呢?

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

回答by arekolek

I use the str_countfunction from the stringrlibrary with the escape sequence \wthat represents:

str_countstringr库中的函数与\w表示的转义序列一起使用:

any ‘word' character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)

任何“单词”字符(当前语言环境中的字母、数字或下划线:在 UTF-8 模式下仅考虑 ASCII 字母和数字)

Example:

例子:

> str_count("How many words are in this sentence", '\w+')
[1] 7


Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.

在我能够测试的所有其他 9 个答案中,只有两个(由 Vincent Zoonekynd 和 petermeissner 提供)适用于迄今为止此处提供的所有输入,但它们也需要stringr.

But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs"or "Combien de mots sont dans cette phrase ?".

但只有此解决方案适用于迄今为止提供的所有输入,以及诸如"foo+bar+baz~spam+eggs"或 之类的输入"Combien de mots sont dans cette phrase ?"

Benchmark:

基准:

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\S+")[[1]]),
    function(s) str_count(s, "\S+"),
    function(s) sapply(gregexpr("\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\w+')
  )

unlist(lapply(funs, score))

Output:

输出:

6 10 10  8  9  9  7  6  6 11

回答by mathematical.coffee

str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])

The gsub(' {2,}',' ',str1)makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.

gsub(' {2,}',' ',str1)品牌确保所有单词都只有一个空格分开,用一个空格替换两个或多个空格的所有出现。

The strsplit(str,' ')splits the sentence at every space and returns the result in a list. The [[1]]grabs the vector of words out of that list. The lengthcounts up how many words.

strsplit(str,' ')在每一个空间分割句子并返回结果列表中。该[[1]]抓住的话的矢量指出,名单中。该length计数多少字。

> str1 <- "How many words are in this     sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> strsplit(str2,' ')[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7

回答by Vincent Zoonekynd

You can use str_match_all, with a regular expression that would identify your words. The following works with initial, final and duplicated spaces.

您可以将str_match_all, 与可以识别您的单词的正则表达式一起使用。以下适用于初始、最终和重复空间。

library(stringr)
s <-  "
  Day after day, day after day,
  We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\S+" )  # Sequences of non-spaces
length(m[[1]])

回答by bartektartanus

Try this function from stringipackage

stringi包中试试这个功能

   require(stringi)
   > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
    +        "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
    +        "Cras vel lorem. Etiam pellentesque aliquet tellus.",
    +        "")
    > stri_stats_latex(s)
        CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
              133             0            30            24             0             0 

回答by yuqian

You can use wcfunction in library qdap:

您可以在库qdap 中使用wc函数:

> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7

回答by Murali Menon

You can remove double spaces and count the number of " "in the string to get the count of words. Use stringrand rm_white{qdapRegex}

您可以删除双空格并计算" "字符串中的数量以获得单词数。使用 stringrrm_white{ qdapRegex}

str_count(rm_white(s), " ") +1

回答by Sangram

Try this

尝试这个

length(unlist(strsplit(str1," ")))