Javascript 当您双击日语文本时,Chrome 如何决定要突出显示的内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/61672829/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How does Chrome decide what to highlight when you double-click Japanese text?
提问by polm23
If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:
如果您在 Chrome 中双击英文文本,则会突出显示您单击的以空格分隔的单词。这并不奇怪。然而,前几天我在阅读一些日语文本时点击并注意到一些单词在单词边界处突出显示,即使日语没有空格。这是一些示例文本:
どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。
如何
For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.
例如,如果您单击薄暗い,即使它不是单个字符类(这是汉字和平假名的混合),Chrome 也会正确地将其突出显示为单个单词。并非所有的亮点都是正确的,但它们似乎不是随机的。
How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental modulethat doesn't seem active in my version of Chrome.
Chrome 如何决定在这里突出显示什么?我尝试在 Chrome 源代码中搜索“日语单词”,但只找到了一个实验模块的测试,该模块在我的 Chrome 版本中似乎并不活跃。
回答by polm23
So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.
所以事实证明 v8 有一个非标准的多语言分词器,它可以处理日语。
function tokenizeJA(text) {
var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
it.adoptText(text)
var words = []
var cur = 0, prev = 0
while (cur < text.length) {
prev = cur
cur = it.next()
words.push(text.substring(prev, cur))
}
return words
}
console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]
I also made a jsfiddlethat shows this.
我还制作了一个jsfiddle来展示这一点。
The quality is not amazing but I'm surprised this is supported at all.
质量并不惊人,但我很惊讶这完全受支持。
回答by erjiang
Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."
根据JonathonW 发布的链接,答案基本上可以归结为:“有一个很大的日语单词列表,Chrome 会检查您是否双击了某个单词。”
Specifically, v8 uses ICUto do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator"for languages that don't have spaces, including Japanese, Chinese, Thai, etc.
具体来说,v8 使用ICU来做一堆与 Unicode 相关的文本处理事情,包括将文本分解成单词。ICU 边界检测代码包括一个“基于字典的 BreakIterator”,用于没有空格的语言,包括日语、中文、泰语等。
And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU(line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.
对于“薄暗い”的具体例子,您可以在ICU运送的汉日合词典中找到该词(第255431行)。目前列表中共有 315,671 个中文/日文单词。据推测,如果您发现 Chrome 无法正确拆分某个单词,则可以向 ICU 发送补丁以添加该单词。

