在 Python 中检测具有非英文字符的字符串

Question

提问by TJ1

I have some strings that have a mix of English and none English letters. For example:

我有一些混合了英文和没有英文字母的字符串。例如：

w='_1991_??_??2'

How can I recognize these types of string using Regex or any other fast method in Python?

如何使用正则表达式或 Python 中的任何其他快速方法识别这些类型的字符串？

I prefer not to compare letters of the string one by one with a list of letters, but to do this in one shot and quickly.

我不喜欢将字符串中的字母与字母列表一个一个地进行比较，而是一次性快速地进行比较。

Answer 1

采纳答案by Salvador Dali

You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.

您可以检查字符串是否只能使用 ASCII 字符（拉丁字母 + 一些其他字符）进行编码。如果它不能被编码，那么它就有来自其他字母表的字符。

Note the comment # -*- coding: ..... It should be there at the top of the python file (otherwise you would receive some error about encoding)

注意注释# -*- coding: ....。它应该在 python 文件的顶部（否则你会收到一些关于编码的错误）

# -*- coding: utf-8 -*-
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

assert not isEnglish('slabiky, ale li?í se podle vyznamu')
assert isEnglish('English')
assert not isEnglish('?? ???????? ?? ?????? ??')
assert not isEnglish('how about this one : 通 asf?')
assert isEnglish('?fd4))45s&')

Answer 2

回答by Katerina

If you work with strings (not unicode objects), you can clean it with translation and check with isalnum(), which is better than to throw Exceptions:

如果你使用字符串（不是 unicode 对象），你可以用翻译清理它并检查isalnum()，这比抛出异常更好：

import string

def isEnglish(s):
    return s.translate(None, string.punctuation).isalnum()


print isEnglish('slabiky, ale li?í se podle vyznamu')
print isEnglish('English')
print isEnglish('?? ???????? ?? ?????? ??')
print isEnglish('how about this one : 通 asf?')
print isEnglish('?fd4))45s&')
print isEnglish('Текст на русском')

> False
> True
> False
> False
> True
> False

Also you can filter non-ascii characters from string with this function:

您也可以使用此函数从字符串中过滤非 ascii 字符：

ascii = set(string.printable)   

def remove_non_ascii(s):
    return filter(lambda x: x in ascii, s)


remove_non_ascii('slabiky, ale li?í se podle vyznamu')
> slabiky, ale li se podle vznamu

Answer 3

回答by PemaGrg

import re

english_check = re.compile(r'[a-z]')

if english_check.match(w):
    print "english",w
else:
    print "other:",w

Answer 4

回答by Furkan

w.isidentifier()

You can easily see the method in docs:

您可以轻松地在文档中看到该方法：

Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

如果根据语言定义、标识符和关键字部分，字符串是有效标识符，则返回 true。

Answer 5

回答by roi3363

I believe this one would have a minimal runtime since it stops once it finds a character which is not a latin letter. It also uses a generator for better memory usage.

我相信这个将有一个最小的运行时间，因为它一旦找到一个不是拉丁字母的字符就会停止。它还使用生成器来更好地使用内存。

import string

def has_only_latin_letters(name):
    char_set = string.ascii_letters
    return all((True if x in char_set else False for x in name))

>>> has_only_latin_letters('_1991_??_??2')
False
>>> has_only_latin_letters('bla bla')
True
>>> has_only_latin_letters('bl? bl?')
False
>>> has_only_latin_letters('????????')
False
>>> has_only_latin_letters('also a string with numbers and punctuation 1, 2, 4')
True

You can also use a different set of characters:

您还可以使用不同的字符集：

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'

>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

>>> string.digits
'0123456789'

>>> string.digits + string.lowercase
'0123456789abcdefghijklmnopqrstuvwxyz'    

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%& 
\'()*+,-./:;<=>?@[\]^_`{|}~ \t\n\r\x0b\x0c'

To add latin accented letters, you can refer to this post.

要添加拉丁重音字母，您可以参考这篇文章。

Answer 6

回答by Torello

IMHO it is the simpliest solution:

恕我直言，这是最简单的解决方案：

def isEnglish(s):
  return s.isascii()

print(isEnglish("Test"))
print(isEnglish("_1991_??_??2"))

Output:
True
False

在 Python 中检测具有非英文字符的字符串

提问by TJ1

采纳答案by Salvador Dali

回答by Katerina

回答by PemaGrg

回答by Furkan

回答by roi3363

回答by Torello

相关推荐

最近更新

标签

在 Python 中检测具有非英文字符的字符串

提问by TJ1

采纳答案by Salvador Dali

回答by Katerina

回答by PemaGrg

回答by Furkan

回答by roi3363

回答by Torello

相关推荐

Python 使用 cv2.VideoCapture() 从 IP 摄像头读取流

Python 按列解压 NumPy 数组

Python jinja2 - 如何在 if 语句中放置一个块？

将 numpy 类型转换为 python

相关推荐

最近更新

标签