如何在 Ruby 中替换带重音的拉丁字符？

Question

提问by James A. Rosen

I have an ActiveRecordmodel, Foo, which has a namefield. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_namefield against which to search:

我有一个ActiveRecord模型，Foo，它有一个name字段。我希望用户能够按名称搜索，但我希望搜索忽略大小写和任何重音符号。因此，我还存储了一个canonical_name要搜索的字段：

class Foo
  validates_presence_of :name

  before_validate :set_canonical_name

  private

  def set_canonical_name
    self.canonical_name ||= canonicalize(self.name) if self.name
  end

  def canonicalize(x)
    x.downcase.  # something here
  end
end

I need to fill in the "something here" to replace the accented characters. Is there anything better than

我需要填写“这里的东西”来替换重音字符。有什么比

x.downcase.gsub(/[àáa???]/,'a').gsub(/?/,'ae').gsub(/?/, 'c').gsub(/[èéê?]/,'e')....

And, for that matter, since I'm not on Ruby 1.9, I can't put those Unicode literals in my code. The actual regular expressions will look much uglier.

而且，就此而言，由于我使用的不是 Ruby 1.9，因此我无法将这些 Unicode 文字放入我的代码中。实际的正则表达式会看起来更丑陋。

Answer 1

采纳答案by unexist

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:

Rails 已经有一个用于标准化的内置函数，您只需要使用它来标准化您的字符串以形成 KD，然后像这样删除其他字符（即重音符号）：

>> "àáa???".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"

Answer 2

回答by Mark Wilden

ActiveSupport::Inflector.transliterate(requires Rails 2.2.1+ and Ruby 1.9 or 1.8.7)

ActiveSupport::Inflector.transliterate（需要 Rails 2.2.1+ 和 Ruby 1.9 或 1.8.7）

example:

例子：

>> ActiveSupport::Inflector.transliterate("àáa???").to_s => "aaaaaa"

Answer 3

回答by Diego Moreira

Better yet is to use I18n:

更好的是使用 I18n：

1.9.3-p392 :001 > require "i18n"
 => false
1.9.3-p392 :002 > I18n.transliterate("Olá Mundo!")
 => "Ola Mundo!"

Answer 4

回答by fguillen

I have tried a lot of this approaches but they were not achieving one or several of these requirements:

我已经尝试了很多这样的方法，但他们没有达到这些要求中的一个或几个：

Respect spaces
Respect '?' character
Respect case (I know is not a requirement for the original question but is not difficult to move an string to lowcase)

尊重空间
尊重 '？' 特点
尊重大小写（我知道这不是原始问题的要求，但将字符串移动到小写并不难）

Has been this:

一直这样：

# coding: utf-8
string.tr(
  "àá????àáa???āā??????????????De????èéê?èéê?ēē??????ěě????????????ìí??ìí????īī????????????????????????ń???ň???òó????òó????ōō?????????????????????????ùú?üùú?ü??ūū??????????Yy??????????",
  "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
)

– http://blog.slashpoundbang.com/post/12938588984/remove-all-accents-and-diacritics-from-string-in-ruby

You have to modify a little bit the character list to respect '?' character but is an easy job.

您必须稍微修改字符列表以尊重“？” 性格，但是一份轻松的工作。

Answer 5

回答by Dorian

My answer: the String#parameterizemethod:

我的回答：String#parameterize方法：

"Le c?ur de la crémiére".parameterize
=> "le-coeur-de-la-cremiere"

For non-Rails programs:

对于非 Rails 程序：

Install activesupport: gem install activesupportthen:

安装 activesupport:gem install activesupport然后：

require 'active_support/inflector'
"a&]'s--34\xC2àáa?3D".parameterize
# => "a-s-3-3d"

Answer 6

回答by Cheng

Decompose the string and remove non-spacing marksfrom it.

分解字符串并从中删除非间距标记。

irb -ractive_support/all
> "àáa???".mb_chars.normalize(:kd).gsub(/\p{Mn}/, '')
aaaaaa

You may also need this if used in a .rb file.

如果在 .rb 文件中使用，您可能还需要它。

# coding: utf-8

the normalize(:kd)part here splits out diacriticals where possible (ex: the "n with tilda" single character is split into an n followed by a combining diacritical tilda character), and the gsubpart then removes all the diacritical characters.

normalize(:kd)此处的部分在可能的情况下拆分变音符号（例如：“n with tilda”单个字符被拆分为 n 后跟一个组合变音 tilda 字符），然后该gsub部分删除所有变音字符。

Answer 7

回答by Jonke

I think that you maybe don't really what to go down that path. If you are developing for a market that has these kind of letters your users probably will think you are a sort of ...pip. Because '?' isn't even close to 'a' in any meaning to a user. Take a different road and read up about searching in a non-ascii way. This is just one of those cases someone invented unicode and collation.

我认为你可能真的不知道要走那条路。如果你正在为一个有这些字母的市场开发，你的用户可能会认为你是一种...... pip。因为 '？' 对于用户来说，它甚至不接近“a”。走一条不同的路，阅读有关以非 ascii 方式搜索的信息。这只是某人发明 unicode 和collation 的案例之一。

A very late PS:

一个很晚的PS：

http://www.w3.org/International/wiki/Case_folding http://www.w3.org/TR/charmod-norm/#sec-WhyNormalization

Besides that I have no ide way the link to collation go to a msdn page but I leave it there. It should have been http://www.unicode.org/reports/tr10/

除此之外，我没有办法将整理链接转到 msdn 页面，但我将其留在那里。它应该是http://www.unicode.org/reports/tr10/

Answer 8

回答by Sudhir Jonathan

This assumes you use Rails.

这假设您使用 Rails。

"anything".parameterize.underscore.humanize.downcase

Given your requirements, this is probably what I'd do... I think it's neat, simple and will stay up to date in future versions of Rails and Ruby.

鉴于您的要求，这可能就是我要做的...我认为它整洁、简单，并且会在 Rails 和 Ruby 的未来版本中保持最新。

Update: dgilperez pointed out that parameterizetakes a separator argument, so "anything".parameterize(" ")(deprecated) or "anything".parameterize(separator: " ")is shorter and cleaner.

更新：dgilperez 指出parameterize需要一个分隔符参数，所以"anything".parameterize(" ")（不推荐使用）或者"anything".parameterize(separator: " ")更短更干净。

Answer 9

回答by CesarB

Convert the text to normalization form D, remove all codepoints with unicode category non spacing mark (Mn), and convert it back to normalization form C. This will strip all diacritics, and your problem is reduced to a case insensitive search.

将文本转换为规范化形式 D，删除所有带有 unicode 类别非间距标记 (Mn) 的代码点，并将其转换回规范化形式 C。这将去除所有变音符号，您的问题将减少为不区分大小写的搜索。

See http://www.siao2.com/2005/02/19/376617.aspxand http://www.siao2.com/2007/05/14/2629747.aspxfor details.

有关详细信息，请参阅http://www.siao2.com/2005/02/19/376617.aspx和http://www.siao2.com/2007/05/14/2629747.aspx。

Answer 10

回答by James A. Rosen

The key is to use two columns in your database: canonical_textand original_text. Use original_textfor display and canonical_textfor searches. That way, if a user searches for "Visual Cafe," she sees the "Visual Café" result. If she reallywants a different item called "Visual Cafe," it can be saved separately.

关键是在数据库中使用两列：canonical_text和original_text。使用original_text用于显示和canonical_text进行搜索。这样，如果用户搜索“Visual Cafe”，她会看到“Visual Café”结果。如果她真的想要一个名为“Visual Cafe”的不同项目，它可以单独保存。

To get the canonical_text characters in a Ruby 1.8 source file, do something like this:

要在 Ruby 1.8 源文件中获取 canonical_text 字符，请执行以下操作：

register_replacement([0x008A].pack('U'), 'S')

如何在 Ruby 中替换带重音的拉丁字符？

提问by James A. Rosen

采纳答案by unexist

回答by Mark Wilden

回答by Diego Moreira

回答by fguillen

回答by Dorian

回答by Cheng

回答by Jonke

回答by Sudhir Jonathan

回答by CesarB

回答by James A. Rosen

相关推荐

最近更新

标签

如何在 Ruby 中替换带重音的拉丁字符？

提问by James A. Rosen

采纳答案by unexist

回答by Mark Wilden

回答by Diego Moreira

回答by fguillen

回答by Dorian

回答by Cheng

回答by Jonke

回答by Sudhir Jonathan

回答by CesarB

回答by James A. Rosen

相关推荐

JRuby on Rails 与 Ruby on Rails，有什么区别？

Ruby-on-rails Rails ActiveRecord 查询不相等

在 Ruby on Rails 中识别 GET 和 POST 参数

Ruby-on-rails 渲染部分 :collection => @array 指定变量名

相关推荐

最近更新

标签