string 计算字符串中所有单词的数量

Question

提问by John

Is there a function to count the number of words in a string? For example:

是否有计算字符串中单词数的函数？例如：

str1 <- "How many words are in this sentence"

to return a result of 7.

返回 7 的结果。

Answer 1

采纳答案by AVSuresh

You can use strsplitand sapplyfunctions

您可以使用strsplit和sapply功能

sapply(strsplit(str1, " "), length)

Answer 2

回答by Martin Morgan

Use the regular expression symbol \\Wto match non-word characters, using +to indicate one or more in a row, along with gregexprto find all matches in a string. Words are the number of word separators plus 1.

使用正则表达式符号\\W匹配非单词字符，+用于表示一行中的一个或多个，以及gregexpr查找字符串中的所有匹配项。单词是单词分隔符的数量加 1。

lengths(gregexpr("\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplitsolutions, which will allocate memory for each word. Regular expressions are described in ?regex.

这将在字符向量的开头或结尾处出现空白字符串而失败，当“单词”不满足\\W的非单词概念时（可以使用其他正则表达式\\S+，[[:alpha:]], 等，但总会有是使用正则表达式方法的边缘情况）等。它可能比strsplit解决方案更有效，后者将为每个单词分配内存。中描述了正则表达式?regex。

UpdateAs noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

更新如评论和@Andri 的不同答案中所述，该方法失败，带有（零）和一个单词字符串，并带有尾随标点符号

str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\W+", str1)) + 1L
# [1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

许多其他答案在这些或类似（例如，多个空格）的情况下也失败了。我认为我的答案在原始答案中关于“一个词的概念”的警告涵盖了标点问题（解决方案：选择不同的正则表达式，例如，[[:space:]]+），但零和一个词的情况是一个问题；@Andri 的解决方案无法区分零个单词和一个单词。因此，采取一种“积极”的方法来寻找可能的词

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

导致

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

同样，正则表达式可能会针对“单词”的不同概念进行细化。

I like the use of gregexpr()because it's memory efficient. An alternative using strsplit()(like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

我喜欢使用，gregexpr()因为它的内存效率很高。另一种使用strsplit()（如@user813966，但使用正则表达式来分隔单词）并利用分隔单词的原始概念是

lengths(strsplit(str1, "\W+"))
# [1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

这需要为创建的每个单词以及中间单词列表分配新的内存。当数据“大”时，这可能相对昂贵，但对于大多数目的来说它可能是有效且易于理解的。

Answer 3

回答by petermeissner

Most simple waywould be:

最简单的方法是：

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\S+")

... counting all sequences on non-space characters (\\S+).

... 计算非空格字符 ( \\S+)上的所有序列。

But what about a little function that lets us also decide which kind of wordswe would like to count and which works on whole vectorsas well?

但是关于一个小功能，让我们也决定什么哪一种的话，我们想计算和对整个向量工作呢？

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

Answer 4

回答by arekolek

I use the str_countfunction from the stringrlibrary with the escape sequence \wthat represents:

我str_count将stringr库中的函数与\w表示的转义序列一起使用：

any ‘word' character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)

任何“单词”字符（当前语言环境中的字母、数字或下划线：在 UTF-8 模式下仅考虑 ASCII 字母和数字）

Example:

例子：

> str_count("How many words are in this sentence", '\w+')
[1] 7

Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.

在我能够测试的所有其他 9 个答案中，只有两个（由 Vincent Zoonekynd 和 petermeissner 提供）适用于迄今为止此处提供的所有输入，但它们也需要stringr.

But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs"or "Combien de mots sont dans cette phrase ?".

但只有此解决方案适用于迄今为止提供的所有输入，以及诸如"foo+bar+baz~spam+eggs"或之类的输入"Combien de mots sont dans cette phrase ?"。

Benchmark:

基准：

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\S+")[[1]]),
    function(s) str_count(s, "\S+"),
    function(s) sapply(gregexpr("\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\w+')
  )

unlist(lapply(funs, score))

Output:

输出：

6 10 10  8  9  9  7  6  6 11

Answer 5

回答by mathematical.coffee

str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])

The gsub(' {2,}',' ',str1)makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.

该gsub(' {2,}',' ',str1)品牌确保所有单词都只有一个空格分开，用一个空格替换两个或多个空格的所有出现。

The strsplit(str,' ')splits the sentence at every space and returns the result in a list. The [[1]]grabs the vector of words out of that list. The lengthcounts up how many words.

将strsplit(str,' ')在每一个空间分割句子并返回结果列表中。该[[1]]抓住的话的矢量指出，名单中。该length计数多少字。

> str1 <- "How many words are in this     sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> strsplit(str2,' ')[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7

Answer 6

回答by Vincent Zoonekynd

You can use str_match_all, with a regular expression that would identify your words. The following works with initial, final and duplicated spaces.

您可以将str_match_all, 与可以识别您的单词的正则表达式一起使用。以下适用于初始、最终和重复空间。

library(stringr)
s <-  "
  Day after day, day after day,
  We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\S+" )  # Sequences of non-spaces
length(m[[1]])

Answer 7

回答by bartektartanus

Try this function from stringipackage

从stringi包中试试这个功能

   require(stringi)
   > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
    +        "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
    +        "Cras vel lorem. Etiam pellentesque aliquet tellus.",
    +        "")
    > stri_stats_latex(s)
        CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
              133             0            30            24             0             0

Answer 8

回答by yuqian

You can use wcfunction in library qdap:

您可以在库qdap 中使用wc函数：

> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7

Answer 9

回答by Murali Menon

You can remove double spaces and count the number of " "in the string to get the count of words. Use stringrand rm_white{qdapRegex}

您可以删除双空格并计算" "字符串中的数量以获得单词数。使用 stringr和rm_white{ qdapRegex}

str_count(rm_white(s), " ") +1

Answer 10

回答by Sangram

Try this

尝试这个

length(unlist(strsplit(str1," ")))

string 计算字符串中所有单词的数量

提问by John

采纳答案by AVSuresh

回答by Martin Morgan

回答by petermeissner

回答by arekolek

回答by mathematical.coffee

回答by Vincent Zoonekynd

回答by bartektartanus

回答by yuqian

回答by Murali Menon

回答by Sangram

相关推荐

最近更新

标签

string 计算字符串中所有单词的数量

提问by John

采纳答案by AVSuresh

回答by Martin Morgan

回答by petermeissner

回答by arekolek

回答by mathematical.coffee

回答by Vincent Zoonekynd

回答by bartektartanus

回答by yuqian

回答by Murali Menon

回答by Sangram

相关推荐

oracle 在 SQL Developer 中哪里可以找到 tnsnames.ora？

string 如何在 Perl 中用正斜杠替换反斜杠？

string 如何在 MATLAB 中的元胞数组中搜索字符串？

string 如何将 Int 转换为给定长度的字符串，并带有前导零以对齐？

相关推荐

最近更新

标签