ruby 1.9，force_encoding，但请检查

Question

提问by jrochkind

I have a string I have read from some kind of input.

我有一个从某种输入中读取的字符串。

To the best of my knowledge, it is UTF8. Okay:

据我所知，它是 UTF8。好的：

string.force_encoding("utf8")

But if this string has bytes in it that are not in fact legal UTF8, I want to know now and take action.

但是，如果这个字符串中包含实际上不是合法 UTF8 的字节，我想现在就知道并采取行动。

Ordinarily, will force_encoding("utf8") raise if it encounters such bytes? I believeit will not.

通常，如果遇到这样的字节， force_encoding("utf8") 会引发吗？我相信不会。

If I was doing an #encodeI could choose from the handy options with what to do with characters that are invalid in the source encoding (or destination encoding).

如果我正在执行#encode，我可以从方便的选项中选择如何处理在源编码（或目标编码）中无效的字符。

But I'm not doing an #encode, I'm doing a #force_encoding. It has no such options.

但我不是在做#encode，我在做#force_encoding。它没有这样的选择。

Would it make sense to

有意义吗

string.force_encoding("utf8").encode("utf8")

to get an exception right away? Normally encoding fromutf8 toutf8 doesn't make any sense. But maybe this is the way to get it to raise right away if there's invalid bytes? Or use the :replaceoption etc to do something different with invalid bytes?

立即获得异常？通常从utf8编码到utf8 没有任何意义。但是，如果存在无效字节，也许这是让它立即提高的方法？或者使用:replace选项等对无效字节做一些不同的事情？

But no, can't seem to make that work either.

但不，似乎也无法做到这一点。

Anyone know?

有人知道吗？

1.9.3-p0 :032 > a = "bad: \xc3\x28 okay".force_encoding("utf-8")
=> "bad: \xC3( okay"
1.9.3-p0 :033 > a.valid_encoding?
=> false

Okay, but how do I find and eliminate those bad bytes? Oddly, this does NOT raise:

好的，但是我如何找到并消除那些坏字节？奇怪的是，这不会引发：

1.9.3-p0 :035 > a.encode("utf-8")
 => "bad: \xC3( okay"

If I was converting to a different encoding, it would!

如果我要转换为不同的编码，它会！

1.9.3-p0 :039 > a.encode("ISO-8859-1")
Encoding::InvalidByteSequenceError: "\xC3" followed by "(" on UTF-8

Or if I told it to, it'd replace it with a "?" =>

或者如果我告诉它，它会用“？”代替它。=>

1.9.3-p0 :040 > a.encode("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"

So ruby's got the smarts to know what are bad bytes in utf-8, and to replace em with something else -- when converting to a different encoding. But I don't wantto convert to a different encoding, i want to stay utf8 -- but I might want to raise if there's an invalid byte in there, or I might want to replace invalid bytes with replacement chars.

所以 ruby 很聪明地知道什么是 utf-8 中的坏字节，并在转换为不同的编码时用其他东西替换 em。但我不想转换为不同的编码，我想保持 utf8 - 但我可能想提高那里是否有无效字节，或者我可能想用替换字符替换无效字节。

Isn't there some way to get ruby to do this?

没有办法让 ruby 做到这一点吗？

updateI believe this has finally been added to ruby in 2.1, with String#scrub present in the 2.1 preview release to do this. So look for that!

更新我相信这终于在 2.1 中被添加到 ruby 中，并且 String#scrub 出现在 2.1 预览版中可以做到这一点。所以寻找那个！

Answer 1

采纳答案by jrochkind

(update: see https://github.com/jrochkind/scrub_rb)

（更新：见https://github.com/jrochkind/scrub_rb）

So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

所以我在这里编写了一个我需要的解决方案：https: //github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":

但直到最近我才意识到这实际上是内置在 stdlib 中的，您只需要（有点违反直觉）将“二进制”作为“源编码”传递：

a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: ?( okay"

Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

是的，这正是我想要的。所以事实证明这是内置在 1.9 stdlib 中的，它只是没有记录，很少有人知道它（或者可能很少有人会说英语？）。虽然我在某个地方的博客上看到这些论点是这样使用的，所以别人知道！

Answer 2

回答by jrochkind

In ruby 2.1, the stdlib finally supports this with scrub.

在 ruby 2.1 中，stdlib 终于通过scrub.

http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub

Answer 3

回答by peter

make sure that your scriptfile itself is saved as UTF8 and try the following

确保您的脚本文件本身保存为 UTF8 并尝试以下操作

# encoding: UTF-8
p [a = "bad: \xc3\x28 okay", a.valid_encoding?]
p [a.force_encoding("utf-8"), a.valid_encoding?]
p [a.encode!("ISO-8859-1", :invalid => :replace), a.valid_encoding?]

This gives on my windows7 system the following

这在我的 windows7 系统上给出了以下内容

["bad: \xC3( okay", false]
["bad: \xC3( okay", false]
["bad: ?( okay", true]

So your bad char is replaced, you can do it right away as follows

所以你的坏字符被替换了，你可以马上做如下

a = "bad: \xc3\x28 okay".encode!("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"

EDIT: here a solution that works on any arbitrary encoding, the first encodes only the bad chars, the second just replaces by a ?

编辑：这里有一个适用于任何任意编码的解决方案，第一个只编码坏字符，第二个只是用 ?

def validate_encoding(str)
  str.chars.collect do |c| 
    (c.valid_encoding?) ? c:c.encode!(Encoding.locale_charmap, :invalid => :replace)
  end.join 
end

def validate_encoding2(str)
  str.chars.collect do |c| 
    (c.valid_encoding?) ? c:'?'
  end.join 
end

a = "bad: \xc3\x28 okay"

puts validate_encoding(a)                  #=>bad: ?( okay
puts validate_encoding(a).valid_encoding?  #=>true


puts validate_encoding2(a)                  #=>bad: ?( okay
puts validate_encoding2(a).valid_encoding?  #=>true

Answer 4

回答by Wayne Conrad

To check that a string has no invalid sequences, try to convert it to the binaryencoding:

要检查字符串是否没有无效序列，请尝试将其转换为二进制编码：

# Returns true if the string has only valid sequences
def valid_encoding?(string)
  string.encode('binary', :undef => :replace)
  true
rescue Encoding::InvalidByteSequenceError => e
  false
end

p valid_encoding?("\xc0".force_encoding('iso-8859-1'))    # true
p valid_encoding?("\u1111")                               # true
p valid_encoding?("\xc0".force_encoding('utf-8'))         # false

This code replaces undefined characters, because we don't care if there are valid sequences that cannot be represented in binary. We only care if there are invalid sequences.

此代码替换未定义的字符，因为我们不关心是否存在无法用二进制表示的有效序列。我们只关心是否存在无效序列。

A slight modification to this code returns the actual error, which has valuable information about the improper encoding:

对此代码稍作修改会返回实际错误，其中包含有关不正确编码的宝贵信息：

# Returns the encoding error, or nil if there isn't one.

def encoding_error(string)
  string.encode('binary', :undef => :replace)
  nil
rescue Encoding::InvalidByteSequenceError => e
  e.to_s
end

# Returns truthy if the string has only valid sequences

def valid_encoding?(string)
  !encoding_error(string)
end

puts encoding_error("\xc0".force_encoding('iso-8859-1'))    # nil
puts encoding_error("\u1111")                               # nil
puts encoding_error("\xc0".force_encoding('utf-8'))         # "\xC0" on UTF-8

Answer 5

回答by jj_

If you are doing this for a "real-life" use case - for example for parsing different strings entered by users, and not just for the sake of being able to "decode" a totally random file which could be made of as many encodings as you wish, then I guess you could at least assume that all charcters for each string have the same encoding.

如果您是为“现实生活”用例执行此操作 - 例如用于解析用户输入的不同字符串，而不仅仅是为了能够“解码”一个完全随机的文件，该文件可以由尽可能多的编码组成如您所愿，那么我想您至少可以假设每个字符串的所有字符都具有相同的编码。

Then, in this case, what would you think about this?

那么，在这种情况下，你会怎么想？

strings = [ "UTF-8 string with some utf8 chars \xC3\xB2 \xC3\x93", 
             "ISO-8859-1 string with some iso-8859-1 chars \xE0 \xE8", "..." ]

strings.each { |s| 
    s.force_encoding "utf-8"
    if s.valid_encoding?
        next
    else
        while s.valid_encoding? == false 
                    s.force_encoding "ISO-8859-1"
                    s.force_encoding "..."
                end
        s.encode!("utf-8")
    end
}

I am not a Ruby "pro" in any way, so please forgive if my solution is wrong or even a bit naive..

我在任何方面都不是 Ruby 的“专业人士”，所以如果我的解决方案是错误的，甚至有点幼稚，请原谅。

I just try to give back what I can, and this is what I've come to, while I was (I still am) working on this little parser for arbitrarily encoded strings, which I am doing for a study-project.

我只是尝试回馈我能做的，这就是我所想到的，而我（我仍然）正在研究这个用于任意编码字符串的小解析器，我正在为一个研究项目做这件事。

While I'm posting this, I must admit that I've not even fully tested it.. I.. just got a couple of "positive" results, but I felt so excited of possibly having found what I was struggling to find (and for all the time I spent reading about this on SO..) that I just felt the need to share it as quick as possible, hoping that it could help save some time to anyone who has been looking for this for as long as I've been... .. if it works as expected :)

当我发布这个时，我必须承认我什至没有完全测试它......我......刚刚得到了几个“积极”的结果，但我感到很兴奋可能已经找到了我正在努力寻找的东西（并且一直以来我都在 SO 上阅读有关此内容的所有时间..) 我只是觉得有必要尽快分享它，希望它可以帮助那些一直在寻找此内容的人节省一些时间一直......如果它按预期工作:)

Answer 6

回答by Mark Reed

About the only thing I can think of is to transcode to something and back that won't damage the string in the round-trip:

关于我能想到的唯一一件事是转码为不会损坏往返中的字符串的内容并返回：

string.force_encoding("UTF-8").encode("UTF-32LE").encode("UTF-8")

Seems rather wasteful, though.

不过好像比较浪费。

Answer 7

回答by jrochkind

Okay, here's a really lame pure ruby way to do it I figured out myself. It probably performs for crap. what the heck, ruby? Not selecting my own answer for now, hoping someone else will show up and give us something better.

好的，这是我自己发现的一种非常蹩脚的纯 ruby 方法。它可能表现得很糟糕。怎么了，红宝石？现在不选择我自己的答案，希望其他人会出现并给我们更好的东西。

 # Pass in a string, will raise an Encoding::InvalidByteSequenceError
 # if it contains an invalid byte for it's encoding; otherwise
 # returns an equivalent string.
 #
 # OR, like String#encode, pass in option `:invalid => :replace`
 # to replace invalid bytes with a replacement string in the
 # returned string.  Pass in the
 # char you'd like with option `:replace`, or will, like String#encode
 # use the unicode replacement char if it thinks it's a unicode encoding,
 # else ascii '?'.
 #
 # in any case, method will raise, or return a new string
 # that is #valid_encoding?
 def validate_encoding(str, options = {})
   str.chars.collect do |c|
     if c.valid_encoding?
       c
     else
       unless options[:invalid] == :replace
         # it ought to be filled out with all the metadata
         # this exception usually has, but what a pain!
         raise  Encoding::InvalidByteSequenceError.new
       else
         options[:replace] || (
          # surely there's a better way to tell if
          # an encoding is a 'Unicode encoding form'
          # than this? What's wrong with you ruby 1.9?
          str.encoding.name.start_with?('UTF') ?
             "\uFFFD" :
             "?" )
       end
     end 
   end.join
 end

More ranting at http://bibwild.wordpress.com/2012/04/17/checkingfixing-bad-bytes-in-ruby-1-9-char-encoding/

更多咆哮在http://bibwild.wordpress.com/2012/04/17/checkingfixing-bad-bytes-in-ruby-1-9-char-encoding/

Answer 8

回答by Andreas Rayo Kniep

Here are 2 common situations and how to deal with them in Ruby 2.1+. I know, the question refers to Ruby v1.9, but maybe this is helpful for others finding this question via Google.

以下是 2 种常见情况以及如何在Ruby 2.1+ 中处理它们。我知道，这个问题是指 Ruby v1.9，但也许这对其他人通过谷歌找到这个问题有帮助。

Situation 1

情况一

You have an UTF-8 string with possibly a few invalid bytes
Remove the invalid bytes:

您有一个 UTF-8 字符串，其中可能包含一些无效字节
删除无效字节：

str = "Partly valid\xE4 UTF-8 encoding: ??ü?"

str.scrub('')
 # => "Partly valid UTF-8 encoding: ??ü?"

Situation 2

情况二

You have a string that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):

您有一个可以是 UTF-8 或 ISO-8859-1 编码的字符串
检查它是哪种编码并转换为 UTF-8（如有必要）：

str = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace, undef: :replace, replace: '?' )
end #unless
 # => "String in ISO-8859-1 encoding: ??ü?"

Notes

笔记

The above code snippets assume that Ruby encodes all your strings in UTF-8by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8(in Ruby, see: #valid_encoding?). However, it is NOT (easily) possible to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1encoding.
Even though UTF-8has become increasingly popular as the default encoding in the web, ISO-8859-1and other Latin1flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252(a.k.a. Windows-1252), ISO-8859-15

上面的代码片段假设 RubyUTF-8默认对所有字符串进行编码。尽管如此，这几乎总是如此，您可以通过以# encoding: UTF-8.
如果无效，则可以通过编程方式检测大多数多字节编码，例如UTF-8（在 Ruby 中，请参阅：）#valid_encoding?。但是，以编程方式检测单字节编码（如ISO-8859-1. 因此上面的代码片段不能反过来工作，即检测一个字符串是否是有效的ISO-8859-1编码。
尽管UTF-8作为网络中的默认编码越来越流行，ISO-8859-1其他Latin1风格在西方国家仍然很受欢迎，尤其是在北美。请注意，有几种单字节编码非常相似，但与 ISO-8859-1 略有不同。例子：（CP1252又名Windows-1252），ISO-8859-15

Answer 9

回答by Tallak Tveide

A simple way to provoke an exception seems to be:

引发异常的一种简单方法似乎是：

untrusted_string.match /./

ruby 1.9，force_encoding，但请检查

提问by jrochkind

采纳答案by jrochkind

回答by jrochkind

回答by peter

回答by Wayne Conrad

回答by jj_

回答by Mark Reed

回答by jrochkind

回答by Andreas Rayo Kniep

Situation 1

情况一

Situation 2

情况二

回答by Tallak Tveide

相关推荐

最近更新

标签

ruby 1.9，force_encoding，但请检查

提问by jrochkind

采纳答案by jrochkind

回答by jrochkind

回答by peter

回答by Wayne Conrad

回答by jj_

回答by Mark Reed

回答by jrochkind

回答by Andreas Rayo Kniep

Situation 1

情况一

Situation 2

情况二

回答by Tallak Tveide

相关推荐

Windows 7 中的“ruby.exe 未被识别为内部或外部命令”

ruby CSV.read 在第 x 行非法引用

Ruby - time.now UTC

ruby “私有”、“公共”和“受保护方法”之间有什么区别？

相关推荐

最近更新

标签