从 PHP 中的字符串检测语言

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1441562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 02:35:30  来源:igfitidea点击:

Detect language from string in PHP

phplanguage-detection

提问by Beier

In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.

在 PHP 中,有没有办法检测字符串的语言?假设字符串是 UTF-8 格式。

采纳答案by ólafur Waage

You can not detect the language from the character type. And there are no foolproof ways to do this.

您无法从字符类型中检测语言。并且没有万无一失的方法来做到这一点。

With any method, you're just doing an educated guess. There are available some math related articlesout there

使用任何方法,您都只是在进行有根据的猜测。有一些可用的数学相关的文章在那里

回答by scott

I've used the Text_LanguageDetect pear packagewith some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

我已经使用了Text_LanguageDetect pear 包并获得了一些合理的结果。它使用起来非常简单,而且它有一个适度的 52 种语言数据库。缺点是没有检测到东亚语言。

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

结果是:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)

回答by Swiss Mister

I know this is an old post, but here is what I developed after not finding any viable solution.

我知道这是一篇旧帖子,但这是我在没有找到任何可行的解决方案后开发的内容。

  • other suggestions are all too heavy and too cumbersome for my situation
  • I support a finite number of languageson my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
  • I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
  • So I want a solution with minimal false positives- but don't care so much about false negatives.
  • 其他建议对于我的情况来说都太沉重太麻烦了
  • 我在我的网站上支持有限数量的语言(目前有两种:'en' 和 'de' - 但解决方案适用于更多)。
  • 我需要对用户生成的字符串的语言进行合理的猜测,并且我有一个后备(用户的语言设置)。
  • 所以我想要一个具有最小误报的解决方案- 但不要太在意误报

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

该解决方案使用一种语言中最常见的 20 个词,计算这些词在大海捞针中的出现次数。然后它只是比较排名第一和第二多的语言的计数。如果亚军人数少于获胜者的10%,则获胜者通吃。

Code - Any suggestions for speed improvement are more than welcome!

代码 - 非常欢迎任何提高速度的建议!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // French word list
      // from https://1000mostcommonwords.com/1000-most-common-french-words/
      $wordList['fr'] = array ('comme', 'que',  'tait',  'pour',  'sur',  'sont',  'avec',
                         'tre',  'un',  'ce',  'par',  'mais',  'que',  'est',
                         'il',  'eu',  'la', 'et', 'dans');

      // Spanish word list
      // from https://spanishforyourjob.com/commonwords/
      $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                         'en', 'lo', 'un', 'por', 'qu', 'si', 'una',
                         'los', 'con', 'para', 'est', 'eso', 'las');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }

回答by Esteban Küber

You could do this entirely client side with Google's AJAX Language API(now defunct).

您可以使用Google 的 AJAX 语言 API(现已失效)完全在客户端执行此操作。

With the AJAX Language API, you can translate and detect the language of blocks of text within a webpage using only Javascript. In addition, you can enable transliteration on any textfield or textarea in your web page. For example, if you were transliterating to Hindi, this API will allow users to phonetically spell out Hindi words using English and have them appear in the Hindi script.

借助 AJAX 语言 API,您可以仅使用 Javascript 翻译和检测网页中文本块的语言。此外,您可以在网页中的任何文本字段或文本区域上启用音译。例如,如果您要音译为印地语,此 API 将允许用户使用英语拼读印地语单词,并让它们出现在印地语脚本中。

You can detect automatically a string's language

您可以自动检测字符串的语言

var text = "?Dónde está el ba?o?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And translate any string written in one of the supported languages(also defunct)

并翻译以一种受支持的语言编写的任何字符串(也已失效)

google.language.translate("Hello world", "en", "es", function(result) {
  if (!result.error) {
    var container = document.getElementById("translation");
    container.innerHTML = result.translation;
  }
});

回答by Laurynas

As Google Translate API is going closing down as a free service, you can try this free alternative, which is a replacement for Google Translate API:

由于 Google Translate API 将作为一项免费服务关闭,您可以尝试这个免费替代方案,它是 Google Translate API 的替代品:

http://detectlanguage.com

http://detectlanguage.com

回答by Muzikant

I tried the Text_LanguageDetect library and the results I got were not very good (for instance, the text "test" was identified as Estonian and not English).

我尝试了 Text_LanguageDetect 库,但得到的结果并不是很好(例如,文本“test”被识别为爱沙尼亚语而不是英语)。

I can recommend you try the Yandex Translate APIwhich is FREEfor 1 million characters for 24 hours and up to 10 million characters a month. It supports (according to the documentation) over 60 languages.

我建议您尝试使用Yandex Translate API,该API免费提供 100 万个字符 24 小时和每月最多 1000 万个字符。它支持(根据文档)超过 60 种语言。

<?php
function identifyLanguage($text)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/detect?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (strlen($outputJson->lang) > 0)
            {
                return $outputJson->lang;
            }
        }
    }

    return "unknown";
}

function translateText($text, $targetLang)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/translate?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text) . "&lang=" . urlencode($targetLang);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (count($outputJson->text) > 0 && strlen($outputJson->text[0]) > 0)
            {
                return $outputJson->text[0];
            }
        }
    }

    return $text;
}

header("content-type: text/html; charset=UTF-8");

echo identifyLanguage("エクスペリエンス");
echo "<br>";
echo translateText("エクスペリエンス", "en");
echo "<br>";
echo translateText("エクスペリエンス", "es");
echo "<br>";
echo translateText("エクスペリエンス", "zh");
echo "<br>";
echo translateText("エクスペリエンス", "he");
echo "<br>";
echo translateText("エクスペリエンス", "ja");
echo "<br>";
?>

回答by Robert Sinclair

Text_LanguageDetect pear package produced terrible results: "luxury apartments downtown" is detected as Portuguese...

Text_LanguageDetect梨包产生了可怕的结果:“市中心豪华公寓”被检测为葡萄牙语...

Google API is still the best solution, they give 300$ free credit and warn before charging you anything

Google API 仍然是最好的解决方案,他们提供 300 美元的免费信用并在向您收取任何费用之前发出警告

Below is a super simple function that uses file_get_contents to download the lang detected by the API, so no need to download or install libraries etc.

下面是一个超级简单的函数,它使用 file_get_contents 来下载 API 检测到的 lang,因此无需下载或安装库等。

function guess_lang($str) {

    $str = str_replace(" ", "%20", $str);

    $content = file_get_contents("https://translation.googleapis.com/language/translate/v2/detect?key=YOUR_API_KEY&q=".$str);

    $lang = (json_decode($content, true));

    if(isset($lang))
        return $lang["data"]["detections"][0][0]["language"];
 }

Execute:

执行:

echo guess_lang("luxury apartments downtown montreal"); // returns "en"

You can get your Google Translate API key here: https://console.cloud.google.com/apis/library/translate.googleapis.com/

您可以在此处获取您的 Google Translate API 密钥:https: //console.cloud.google.com/apis/library/translate.googleapis.com/

This is a simple example for short phrases to get you going. For more complex applications you'll want to restrict your API key and use the library obviously.

这是一个简单的短语示例,可以帮助您前进。对于更复杂的应用程序,您需要限制 API 密钥并明显使用该库。

回答by strager

You can probably use the Google Translate APIto detect the language andtranslate it if necessary.

您可能可以使用Google Translate API来检测语言在必要时进行翻译。

回答by adiian

You can see how to detect language for a string in phpusing the Text_LanguageDetectPear Package or downloading to use it separately like a regular php library.

您可以看到如何使用Text_LanguageDetectPear 包或下载以像常规 php 库一样单独使用它来检测 php 中字符串的语言

回答by Andy

Perhaps submit the string to this language guesser:

也许将字符串提交给这个语言猜测器:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser