string 如何获取字符串中的字符数？

Question

提问by Ammar

How can I get the number of characters of a string in Go?

如何在 Go 中获取字符串的字符数？

For example, if I have a string "hello"the method should return 5. I saw that len(str)returns the number of bytes and notthe number of characters so len("￡")returns 2 instead of 1 because ￡ is encoded with two bytes in UTF-8.

例如，如果我有一个字符串，"hello"该方法应该返回5. 我看到len(str)返回的字节数，而不是字符的数量，以便len("￡")返回2而不是1，因为£被编码有在UTF-8的两个字节。

Answer 1

回答by VonC

You can try RuneCountInStringfrom the utf8 package.

您可以RuneCountInString从 utf8 包中尝试。

returns the number of runes in p

返回 p 中的符文数

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but its rune count is 2:

那个，如这个脚本所示：“世界”的长度可能是6（用中文写时：“世界”），但它的符文计数是2：

package main

import "fmt"
import "unicode/utf8"

func main() {
    fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozenadds in the comments:

Phrozen 在评论中补充道：

Actually you can do len()over runes by just type casting.
len([]rune("世界"))will print 2. At leats in Go 1.3.

实际上，您可以len()通过类型转换来完成符文。
len([]rune("世界"))将打印2。至少在 Go 1.3 中。

And with CL 108985(May 2018, for Go 1.11), len([]rune(string))is now optimized. (Fixes issue 24923)

并与CL 108985（五月2018年，为围棋1.11），len([]rune(string))现在已经进行了优化。（修复问题 24923）

The compiler detects len([]rune(string))pattern automatically, and replaces it with for r := range s call.

编译器len([]rune(string))自动检测模式，并将其替换为 for r := range s 调用。

Adds a new runtime function to count runes in a string. Modifies the compiler to detect the pattern len([]rune(string))and replaces it with the new rune counting runtime function.

添加一个新的运行时函数来计算字符串中的符文。修改编译器以检测模式len([]rune(string))并将其替换为新的符文计数运行时函数。

RuneCount/lenruneslice/ASCII                  27.8ns ± 2%  14.5ns ± 3%  -47.70%  (p=0.000 n=10+10)
RuneCount/lenruneslice/Japanese                126ns ± 2%    60ns ± 2%  -52.03%  (p=0.000 n=10+10)
RuneCount/lenruneslice/MixedLength             104ns ± 2%    50ns ± 1%  -51.71%  (p=0.000 n=10+9)

Stefan Steigerpoints to the blog post "Text normalization in Go"

Stefan Steiger指向博客文章“ Go 中的文本规范化”

What is a character?

什么是字符？

As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '????' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalizationwe will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.

正如字符串博客文章中提到的，字符可以跨越多个符文。
例如，' e' 和 '????' （急性“\u0301”）可以组合形成“é”（e\u0301NFD 中的“ ”）。这两个符文加在一起就是一个字符。
字符的定义可能因应用程序而异。
对于规范化，我们将其定义为：
以起始符开头的一系列符文，
不修改或与任何其他符文反向组合的符文，
后跟可能为空的非起始序列，即执行的符文（通常是重音符号）。
归一化算法一次处理一个字符。

Using that package and its Itertype, the actual number of "character" would be:

使用该包及其Iter类型，“字符”的实际数量将是：

package main

import "fmt"
import "golang.org/x/text/unicode/norm"

func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

Here, this uses the Unicode Normalization formNFKD "Compatibility Decomposition"

在这里，这使用了Unicode 规范化形式NFKD“兼容性分解”

Oliver's answerpoints to UNICODE TEXT SEGMENTATIONas the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

Oliver的回答指出，UNICODE 文本分割是可靠确定某些重要文本元素（用户感知的字符、单词和句子）之间默认边界的唯一方法。

For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

为此，您需要一个像rivo/uniseg这样的外部库，它执行Unicode Text Segmentation。

That will actually count "graphemecluster", where multiple code points may be combined into one user-perceived character.

将实际计数“字形簇”，其中多个码点可被组合成一个用户感知的字符。

package uniseg

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    gr := uniseg.NewGraphemes("!")
    for gr.Next() {
        fmt.Printf("%x ", gr.Runes())
    }
    // Output: [1f44d 1f3fc] [21]
}

Two graphemes, even though there are three runes (Unicode code points).

两个字素，即使有三个符文（Unicode 代码点）。

Answer 2

回答by Denis Kreshikhin

There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):

有一种方法可以通过将 string 转换为 []rune as 来获得没有任何包的符文计数len([]rune(YOUR_STRING))：

package main

import "fmt"

func main() {
    russian := "Спутник и погром"
    english := "Sputnik & pogrom"

    fmt.Println("count of bytes:",
        len(russian),
        len(english))

    fmt.Println("count of runes:",
        len([]rune(russian)),
        len([]rune(english)))

}

count of bytes 30 16
count of runes 16 16

字节数 30 16
符文数 16 16

Answer 3

回答by zzzz

Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.

很大程度上取决于您对“角色”的定义。如果“符文等于字符”适合您的任务（通常不是），那么 VonC 的答案对您来说是完美的。否则，应该注意的是，很少有 Unicode 字符串中的符文数量是一个有趣的值的情况。即使在这些情况下，如果可能的话，最好在处理符文时“遍历”字符串时推断计数，以避免将 UTF-8 解码工作加倍。

Answer 4

回答by masakielastic

If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.

如果您需要考虑字素簇，请使用 regexp 或 unicode 模块。由于字素簇的长度是无限的，因此还需要计算代码点（符文）或字节的数量以进行验证。如果要消除极长的序列，请检查序列是否符合流安全文本格式。

package main

import (
    "regexp"
    "unicode"
    "strings"
)

func main() {

    str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
    str2 := "a" + strings.Repeat("\u0308", 1000)

    println(4 == GraphemeCountInString(str))
    println(4 == GraphemeCountInString2(str))

    println(1 == GraphemeCountInString(str2))
    println(1 == GraphemeCountInString2(str2))

    println(true == IsStreamSafeString(str))
    println(false == IsStreamSafeString(str2))
}


func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\PM\pM*|.")
    return len(re.FindAllString(str, -1))
}

func GraphemeCountInString2(str string) int {

    length := 0
    checked := false
    index := 0

    for _, c := range str {

        if !unicode.Is(unicode.M, c) {
            length++

            if checked == false {
                checked = true
            }

        } else if checked == false {
            length++
        }

        index++
    }

    return length
}

func IsStreamSafeString(str string) bool {
    re := regexp.MustCompile("\PM\pM{30,}") 
    return !re.MatchString(str) 
}

Answer 5

回答by Oliver

I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestionswill output the following:

我应该指出，到目前为止提供的所有答案都没有给您提供您期望的字符数，尤其是在您处理表情符号时（还有一些语言，如泰语、韩语或阿拉伯语）。VonC 的建议将输出以下内容：

fmt.Println(utf8.RuneCountInString("??")) // Outputs "6".
fmt.Println(len([]rune("??"))) // Outputs "6".

That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.

那是因为这些方法只计算 Unicode 代码点。有许多字符可以由多个代码点组成。

Same for using the Normalization package:

与使用Normalization 包相同：

var ia norm.Iter
ia.InitString(norm.NFKD, "??")
nc := 0
for !ia.Done() {
    nc = nc + 1
    ia.Next()
}
fmt.Println(nc) // Outputs "6".

Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.

规范化与计算字符实际上并不相同，许多字符无法规范化为一个代码点等价物。

masakielastic's answercomes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):

masakielastic 的回答接近但只处理修饰符（彩虹旗包含一个修饰符，因此不计为它自己的代码点）：

fmt.Println(GraphemeCountInString("??"))  // Outputs "5".
fmt.Println(GraphemeCountInString2("??")) // Outputs "5".

The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/unisegpackage implements these rules so you can determine the correct number of characters in a string:

Unicode 标准附件 #29 中定义了将 Unicode 字符串拆分为（用户感知的）字符（即字素簇）的正确方法。规则可以在第 3.1.1 节中找到。该github.com/rivo/uniseg包实现这些规则，因此可以判断字符串中的字符的正确数量：

fmt.Println(uniseg.GraphemeClusterCount("??")) // Outputs "2".

Answer 6

回答by pigletfly

There are several ways to get a string length:

有几种方法可以获取字符串长度：

package main

import (
    "bytes"
    "fmt"
    "strings"
    "unicode/utf8"
)

func main() {
    b := "这是个测试"
    len1 := len([]rune(b))
    len2 := bytes.Count([]byte(b), nil) -1
    len3 := strings.Count(b, "") - 1
    len4 := utf8.RuneCountInString(b)
    fmt.Println(len1)
    fmt.Println(len2)
    fmt.Println(len3)
    fmt.Println(len4)

}

Answer 7

回答by Marcelloh

I tried to make to do the normalization a bit faster:

我试图更快地进行标准化：

    en, _ = glyphSmart(data)

    func glyphSmart(text string) (int, int) {
        gc := 0
        dummy := 0
        for ind, _ := range text {
            gc++
            dummy = ind
        }
        dummy = 0
        return gc, dummy
    }

string 如何获取字符串中的字符数？

提问by Ammar

回答by VonC

回答by Denis Kreshikhin

回答by zzzz

回答by masakielastic

回答by Oliver

回答by pigletfly

回答by Marcelloh

相关推荐

最近更新

标签

string 如何获取字符串中的字符数？

提问by Ammar

回答by VonC

回答by Denis Kreshikhin

回答by zzzz

回答by masakielastic

回答by Oliver

回答by pigletfly

回答by Marcelloh

相关推荐

string 使用 PowerShell 拆分字符串并对每个令牌执行某些操作

string 使用 Groovy 进行字符串连接

string 如何在 Bash 中将字符串从大写转换为小写？

string 使用 IFS 将字符串转换为数组

相关推荐

最近更新

标签