在 Python 中生成随机的 UTF-8 字符串

Question

提问by l0b0

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

我想测试我的代码的 Unicode 处理。我可以在 random.choice() 中放入任何东西来从整个 Unicode 范围中进行选择，最好不是外部模块？谷歌和 StackOverflow 似乎都没有答案。

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

编辑：看起来这比预期的更复杂，所以我将重新表述这个问题 - 以下代码是否足以在 Unicode 中生成所有有效的非控制字符？

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )

Answer 1

采纳答案by Gumbo

There is a UTF-8 stress test from Markus Kuhnyou could use.

您可以使用Markus Kuhn提供的UTF-8 压力测试。

See also Really Good, Bad UTF-8 example test data.

另请参阅“非常好，坏的 UTF-8 示例测试数据”。

Answer 2

回答by Jacob Wan

People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

人们可能会主要根据问题标题找到他们的方法，因此这里有一种生成包含各种 Unicode 字符的随机字符串的方法。要包含更多（或更少）可能的字符，只需使用您想要的代码点范围扩展示例的那部分。

import random

def get_random_unicode(length):

    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        ( 0x0021, 0x0021 ),
        ( 0x0023, 0x0026 ),
        ( 0x0028, 0x007E ),
        ( 0x00A1, 0x00AC ),
        ( 0x00AE, 0x00FF ),
        ( 0x0100, 0x017F ),
        ( 0x0180, 0x024F ),
        ( 0x2C60, 0x2C7F ),
        ( 0x16A0, 0x16F0 ),
        ( 0x0370, 0x0377 ),
        ( 0x037A, 0x037E ),
        ( 0x0384, 0x038A ),
        ( 0x038C, 0x038C ),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
            for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))

if __name__ == '__main__':
    print('A random string: ' + get_random_unicode(10))

Answer 3

回答by Philipp

Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

下面是一个示例函数，它可能会创建一个随机的格式良好的 UTF-8 序列，如 Unicode 5.0.0 的表 3–7 中所定义：

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

由于 Unicode 标准的广泛性，我无法对此进行彻底测试。还要注意，字符不是均匀分布的（但序列中的每个字节都是）。

Answer 4

回答by aluriak

Follows a code that print any printable character of UTF-8:

遵循打印任何可打印 UTF-8 字符的代码：

print(''.join(tuple(chr(i) for i in range(32, 0x110000) if chr(i).isprintable())))

All printable characters are included above, even those that are not printed by the current font. The clause and not chr(i).isspace()can be added to filter out whitespace characters.

所有可打印的字符都包含在上面，即使是当前字体未打印的字符。and not chr(i).isspace()可以添加该子句以过滤掉空白字符。

Answer 5

回答by Jonathan Leffler

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

这取决于您希望进行测试的彻底程度以及您希望进行生成的准确程度。总的来说，Unicode 是一个 21 位代码集 (U+0000 .. U+10FFFF)。但是，该范围中的一些相当大的块被留出用于自定义字符。您是否要担心在字符串的开头生成组合字符（因为它们应该只出现在另一个字符之后）？

The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

我采用的基本方法是随机生成一个 Unicode 代码点（比如 U+2397 或 U+31232），在上下文中验证它（它是否是合法字符；它是否可以出现在字符串中）并在UTF-8。

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

如果您只想检查您的代码是否正确处理格式错误的 UTF-8，您可以使用更简单的生成方案。

Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

请注意，您需要知道在给定输入的情况下会发生什么 - 否则您不会进行测试；你在试验。

Answer 6

回答by Joril

Since Unicode is just a range of - well - codes, what about using unichr() to get the unicode string corresponding to a random number between 0 and 0xFFFF?
(Of course that would give just one codepoint, so iterate as required)

由于 Unicode 只是一系列 - 好吧 - 代码，那么如何使用 unichr() 获取与 0 和 0xFFFF 之间的随机数对应的 unicode 字符串呢？
（当然，这只会给出一个代码点，因此请根据需要进行迭代）

Answer 7

回答by Esteban Küber

You could download a website written in greek or german that uses unicode and feed that to your code.

您可以下载一个使用 unicode 的希腊语或德语网站，并将其提供给您的代码。

Answer 8

回答by John Machin

Answering revised question:

回答修改后的问题：

Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

是的，对于“控制字符”的严格定义——请注意，您不会包含 CR、LF 和 TAB；那是你要的吗？

Please consider responding to my earlier invitation to tell us what you are really trying to do.

请考虑回应我之前的邀请，告诉我们您真正想要做什么。

在 Python 中生成随机的 UTF-8 字符串

提问by l0b0

采纳答案by Gumbo

回答by Jacob Wan

回答by Philipp

回答by aluriak

回答by Jonathan Leffler

回答by Joril

回答by Esteban Küber

回答by John Machin

相关推荐

最近更新

标签

在 Python 中生成随机的 UTF-8 字符串

提问by l0b0

采纳答案by Gumbo

回答by Jacob Wan

回答by Philipp

回答by aluriak

回答by Jonathan Leffler

回答by Joril

回答by Esteban Küber

回答by John Machin

相关推荐

如何使用 Python 识别二进制和文本文件？

%s 的可变长度与 python 中的 % 运算符

颠覆 python 绑定文档？

python 简单的物体识别

相关推荐

最近更新

标签