string 如何在 Perl 中猜测字符串的编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1970660/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 00:35:49  来源:igfitidea点击:

How can I guess the encoding of a string in Perl?

perlunicodestring

提问by Maulin

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?

我有一个 Unicode 字符串,但不知道它的编码是什么。当 Perl 程序读取此字符串时,是否有 Perl 将使用的默认编码?如果是这样,我怎样才能知道它是什么?

I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:

我试图从输入中去除非 ASCII 字符。我在一些论坛上发现了这个:

my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});

How will the above work when no input encoding is specified? Should it be specified like the following?

当未指定输入编码时,上述内容将如何工作?应该像下面这样指定吗?

my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});

回答by daxim

To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detectand Encode::Guessautomate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detectiveinstead.)

要找出未知使用的编码,您只需尝试查看即可。模块Encode::DetectEncode::Guess自动完成。(如果您在编译 Encode::Detect 时遇到问题,请尝试使用它的分支Encode::Detective。)

use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
              "\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
              "\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
              "\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030

use Encode;
my $string = decode($encoding_name, $unknown);

I find encode 'ascii'is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.

我发现encode 'ascii'摆脱非 ASCII 字符是一个蹩脚的解决方案。一切都会被问号代替;这太有损而无用。

# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.

If you want readable ASCII text, I recommend Text::Unidecodeinstead. This, too, is a lossy encoding, but not as terrible as plain encodeabove.

如果你想要可读的 ASCII 文本,我推荐Text::Unidecode代替。这也是一种有损编码,但并不像encode上面那样糟糕。

use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing  Perl workshop.

However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQor XMLCREF.

但是,如果可以,请避免使用那些有损编码。如果你想在以后的扭转操作,挑中的任何一个PERLQQXMLCREF

use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ);  # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.

回答by brian d foy

The Encodemodule has a way that you can try to do this. You decodethe raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:

编码模块具有一种方式,你可以尝试这样做。您decode使用您认为编码的原始八位字节。如果八位字节不代表有效的编码,它就会爆炸,你可以用 eval 捕获它。否则,您将返回一个正确编码的字符串。例如:

 use Encode;

 my $a_with_ring =
   eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
     or die "Could not decode string: $@";

This has the drawback that the same octet sequence can be valid in multiple encodings

这有一个缺点,即相同的八位字节序列可以在多种编码中有效

I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)

在即将出版的Effective Perl Programming, 2nd Edition 中,我有更多关于这一点的内容,其中有一整章是关于处理 Unicode 的。我认为如果我发布整件事,我的出版商会生气。:)

You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.

您可能还想查看Juerd 的 Unicode Advice以及 Perl 附带的一些 Unicode 文档。

回答by muruga

You can use the following code also, to encrypt and decrypt the code

您也可以使用以下代码来加密和解密代码

sub ENCRYPT_DECRYPT() {
    my $Str_Message=$_[0];
    my  $Len_Str_Message=length($Str_Message);

    my  $Str_Encrypted_Message="";
    for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
        my  $Key_To_Use = (($Len_Str_Message+$Position)+1);
            $Key_To_Use =(255+$Key_To_Use) % 255;
        my  $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
        my  $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
        my  $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
            my  $Encrypted_Byte = chr($Xored_Byte);
        $Str_Encrypted_Message .= $Encrypted_Byte;

    }
    return $Str_Encrypted_Message;
}

 my $var=&ENCRYPT_DECRYPT("hai");
 print &ENCRYPT_DECRYPT($var);