在 Python 中检测具有非英文字符的字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27084617/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:25:18  来源:igfitidea点击:

Detect strings with non English characters in Python

pythonregexnon-english

提问by TJ1

I have some strings that have a mix of English and none English letters. For example:

我有一些混合了英文和没有英文字母的字符串。例如:

w='_1991_??_??2'

How can I recognize these types of string using Regex or any other fast method in Python?

如何使用正则表达式或 Python 中的任何其他快速方法识别这些类型的字符串?

I prefer not to compare letters of the string one by one with a list of letters, but to do this in one shot and quickly.

我不喜欢将字符串中的字母与字母列表一个一个地进行比较,而是一次性快速地进行比较。

采纳答案by Salvador Dali

You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.

您可以检查字符串是否只能使用 ASCII 字符(拉丁字母 + 一些其他字符)进行编码。如果它不能被编码,那么它就有来自其他字母表的字符。

Note the comment # -*- coding: ..... It should be there at the top of the python file (otherwise you would receive some error about encoding)

注意注释# -*- coding: ....。它应该在 python 文件的顶部(否则你会收到一些关于编码的错误)

# -*- coding: utf-8 -*-
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

assert not isEnglish('slabiky, ale li?í se podle vyznamu')
assert isEnglish('English')
assert not isEnglish('?? ???????? ?? ?????? ??')
assert not isEnglish('how about this one : 通 asf?')
assert isEnglish('?fd4))45s&')

回答by Katerina

If you work with strings (not unicode objects), you can clean it with translation and check with isalnum(), which is better than to throw Exceptions:

如果你使用字符串(不是 unicode 对象),你可以用翻译清理它并检查isalnum(),这比抛出异常更好:

import string

def isEnglish(s):
    return s.translate(None, string.punctuation).isalnum()


print isEnglish('slabiky, ale li?í se podle vyznamu')
print isEnglish('English')
print isEnglish('?? ???????? ?? ?????? ??')
print isEnglish('how about this one : 通 asf?')
print isEnglish('?fd4))45s&')
print isEnglish('Текст на русском')

> False
> True
> False
> False
> True
> False

Also you can filter non-ascii characters from string with this function:

您也可以使用此函数从字符串中过滤非 ascii 字符:

ascii = set(string.printable)   

def remove_non_ascii(s):
    return filter(lambda x: x in ascii, s)


remove_non_ascii('slabiky, ale li?í se podle vyznamu')
> slabiky, ale li se podle vznamu

回答by PemaGrg

import re

english_check = re.compile(r'[a-z]')

if english_check.match(w):
    print "english",w
else:
    print "other:",w

回答by Furkan

w.isidentifier()

You can easily see the method in docs:

您可以轻松地在文档中看到该方法:

Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

如果根据语言定义、标识符和关键字部分,字符串是有效标识符,则返回 true。

回答by roi3363

I believe this one would have a minimal runtime since it stops once it finds a character which is not a latin letter. It also uses a generator for better memory usage.

我相信这个将有一个最小的运行时间,因为它一旦找到一个不是拉丁字母的字符就会停止。它还使用生成器来更好地使用内存。

import string

def has_only_latin_letters(name):
    char_set = string.ascii_letters
    return all((True if x in char_set else False for x in name))

>>> has_only_latin_letters('_1991_??_??2')
False
>>> has_only_latin_letters('bla bla')
True
>>> has_only_latin_letters('bl? bl?')
False
>>> has_only_latin_letters('????????')
False
>>> has_only_latin_letters('also a string with numbers and punctuation 1, 2, 4')
True

You can also use a different set of characters:

您还可以使用不同的字符集:

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'

>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

>>> string.digits
'0123456789'

>>> string.digits + string.lowercase
'0123456789abcdefghijklmnopqrstuvwxyz'    

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%& 
\'()*+,-./:;<=>?@[\]^_`{|}~ \t\n\r\x0b\x0c'

To add latin accented letters, you can refer to this post.

要添加拉丁重音字母,您可以参考这篇文章

回答by Torello

IMHO it is the simpliest solution:

恕我直言,这是最简单的解决方案:

def isEnglish(s):
  return s.isascii()

print(isEnglish("Test"))
print(isEnglish("_1991_??_??2"))

Output:
True
False