Java 如何检测用户输入文本的语言?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3227524/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to detect language of user entered text?
提问by ManBugra
I am dealing with an application that is accepting user input in different languages (currently 3 languages fixed). The requirement is that users can enter text and dont bother to select the language via a provided checkbox in the UI.
我正在处理一个接受不同语言的用户输入的应用程序(目前修复了 3 种语言)。要求是用户可以输入文本,而不必费心通过 UI 中提供的复选框来选择语言。
Is there an existing Java libraryto detect the language of a text?
是否有现有的 Java 库来检测文本的语言?
I want something like this:
我想要这样的东西:
text = "To be or not to be thats the question."
// returns ISO 639 Alpha-2 code
language = detect(text);
print(language);
result:
结果:
EN
I dont want to know how to create a language detector by myself(i have seen plenty of blogs trying to do that). The library should provide a simple APi and also work completely offline. Open-source or commercial closed doesn't matter.
我不想知道如何自己创建一个语言检测器(我已经看到很多博客试图这样做)。该库应该提供一个简单的 API,并且还可以完全脱机工作。开源或商业封闭无关紧要。
i also found this questions on SO (and a few more):
我还在 SO(以及更多)上发现了这个问题:
采纳答案by Jay Askren
回答by Carl Smotricz
Google offers an API that can do this for you. I just stumbled across this yesterday and didn't keep a link, but if you, umm, Google for it you should manage to find it.
Google 提供了可以为您执行此操作的 API。我昨天偶然发现了这个并且没有保留链接,但是如果你,嗯,谷歌你应该设法找到它。
This was somewhere near the description of their translation API, which will translate text for you into any language you like. There's another call just for guessing the input language.
这与他们的翻译 API 的描述相近,它将为您将文本翻译成您喜欢的任何语言。还有另一个调用只是为了猜测输入语言。
Google is among the world's leaders in mechanical translation; they base their stuff on extremely large corpuses of text (most of the Internet, kinda) and a statistical approach that usually "gets" it right simply by virtue of having a huge sample space.
谷歌是世界机械翻译领域的领导者之一;他们将他们的东西建立在极大的文本语料库(大多数互联网,有点)和统计方法的基础上,该方法通常仅凭借巨大的样本空间就“正确”。
EDIT: Here's the link: http://code.google.com/apis/ajaxlanguage/
编辑:这是链接:http: //code.google.com/apis/ajaxlanguage/
EDIT 2: If you insist on "offline": A well upvoted answer was the suggestion of Guess-Language. It's a C++ library and handles about 60 languages.
编辑 2:如果你坚持“离线”:一个很好的答案是Guess-Language的建议。它是一个 C++ 库,可以处理大约 60 种语言。
回答by Manny
An alternative is the JLangDetectbut it's not very robust and has a limited language base. Good thing is it's an Apache license, if it satisfies your requirements, you can use it. I'm guessing here, but do you release the space key between the single and double jump event? Version 0.2 has been released here.
另一种选择是JLangDetect,但它不是很健壮,而且语言基础有限。好消息是它是一个 Apache 许可证,如果它满足您的要求,您可以使用它。我猜这里,但是你在单跳和双跳事件之间释放空格键吗?0.2 版已经在这里发布。
In version 0.4 it is very robust. I have been using this in many projects of my own and never had any major problems. Also, when it comes to speed it is comparable to very specialized language detectors (e.g., few languages only).
在 0.4 版本中,它非常健壮。我一直在我自己的许多项目中使用它,从未遇到过任何重大问题。此外,在速度方面,它可以与非常专业的语言检测器相媲美(例如,只有少数语言)。
回答by Omar Jaafor
here is another option : Language Detection Library for Java
这是另一种选择:Java 语言检测库
this is a library in Java.
这是一个 Java 库。
回答by Laurynas
Detect Language APIalso provides Java client.
Detect Language API还提供了Java 客户端。
Example:
例子:
List<Result> results = DetectLanguage.detect("Hello world");
Result result = results.get(0);
System.out.println("Language: " + result.language);
System.out.println("Is reliable: " + result.reliable);
System.out.println("Confidence: " + result.confidence);
回答by yvespeirsman
This Language Detection Library for Javashould give more than 99% accuracy for 53 languages.
这个Java 语言检测库应该为 53 种语言提供超过 99% 的准确率。
Alternatively, there is Apache Tika, a library for content analysis that offers much more than just language detection.
或者,还有Apache Tika,一个用于内容分析的库,它提供的不仅仅是语言检测。
回答by Anand Kumar
Just a working code from already available solution from cybozu labs:
package com.et.generate;
import java.util.ArrayList;
import com.cybozu.labs.langdetect.Detector;
import com.cybozu.labs.langdetect.DetectorFactory;
import com.cybozu.labs.langdetect.LangDetectException;
import com.cybozu.labs.langdetect.Language;
public class LanguageCodeDetection {
public void init(String profileDirectory) throws LangDetectException {
DetectorFactory.loadProfile(profileDirectory);
}
public String detect(String text) throws LangDetectException {
Detector detector = DetectorFactory.create();
detector.append(text);
return detector.detect();
}
public ArrayList<Language> detectLangs(String text) throws LangDetectException {
Detector detector = DetectorFactory.create();
detector.append(text);
return detector.getProbabilities();
}
public static void main(String args[]) {
try {
LanguageCodeDetection ld = new LanguageCodeDetection();
String profileDirectory = "C:/profiles/";
ld.init(profileDirectory);
String text = "Кремль россий";
System.out.println(ld.detectLangs(text));
System.out.println(ld.detect(text));
} catch (LangDetectException e) {
e.printStackTrace();
}
}
}
Output:
[ru:0.9999983255911719]
ru
Profiles can be downloaded from: https://language-detection.googlecode.com/files/langdetect-09-13-2011.zip
可以从以下位置下载配置文件:https: //language-detection.googlecode.com/files/langdetect-09-13-2011.zip