javascript 从文本块中提取相关标签/关键字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4828154/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-25 15:00:10  来源:igfitidea点击:

Extract Relevant Tag/Keywords from Text block

phpjavascripttagsstop-words

提问by sgomez

I wanted a particular implementation, such that the user provide a block of text like:

我想要一个特定的实现,以便用户提供一个文本块,如:

"Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable."

“要求 - 使用 Linux、Apache 2、MySQL 5 和 PHP 5 的 LAMP 环境的工作知识, - Web 2.0 标准的知识 - 熟悉 JSON - 使用框架、Zend、OOP 的实践经验 - 跨浏览器 Javascripting、JQuery 等. - 了解版本控制软件(例如子版本)将更佳。”

What I want to do is automatically select relevant keywords and create tags/keywords, hence for the above piece of text, relevant tags should be: mysql, php, json, jquery, version control, oop, web2.0, javascript

我想要做的是自动选择相关关键字并创建标签/关键字,因此对于上面的一段文字,相关标签应该是:mysql, php, json, jquery, version control, oop, web2.0, javascript

How can I go about doing it in PHP/Javascript etc? A headstart would be really helpful.

我怎样才能在 PHP/Javascript 等中做到这一点?抢先一步会非常有帮助。

回答by Darren Newton

A very naive method is to remove common stopwordsfrom the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalaiswhich can do a rather sophisticated analysis of your text.

一种非常幼稚的方法是从文本中删除常见的停用词,为您留下更有意义的词,例如“标准”、“JSON”等。但是您仍然会收到很多噪音,因此您可以考虑使用像OpenCalais这样的服务对您的文本进行相当复杂的分析。

Update:

更新:

Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

好的,我之前的回答中的链接指向了实现,但是您要了一个,所以这里有一个简单的:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

You can see this, and the contents of stop_word.txtin this Gist.

你可以看到这个,以及stop_word.txt这个Gist 中的内容。

Running the above on your example text produces the following array:

在您的示例文本上运行上面的代码会生成以下数组:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloudrelies on this very service to gather information from documents.

所以,就像我说的那样,这有点幼稚,可以使用更多优化(而且速度很慢),但它确实从您的文本中提取了更相关的关键字。您还需要对停用词进行一些微调。捕获诸如此类的术语Web 2.0将非常困难,因此我再次认为您最好使用像 OpenCalais 这样的严肃服务,它可以理解文本并返回实体和引用列表。DocumentCloud依靠这项服务从文档中收集信息。

Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)

此外,对于客户端实现,您可以使用 JavaScript 做几乎相同的事情,而且可能更简洁(尽管客户端可能会很慢。)

回答by user2412642

I did a quick review of these this morning and to my surprise one which performs best with my test phrase was written in PHP

今天早上我对这些进行了快速回顾,令我惊讶的是,在我的测试短语中表现最好的一个是用 PHP 编写的

What looked like the most professional one performed abysmally: viewer.opencalais.com

看起来最专业的表演非常小:viewer.opencalais.com

Others that were OK were (not sure what language they're written in)

其他还可以的(不确定它们是用什么语言编写的)

  • www.nactem.ac.uk/software/termine/#form
  • www.alchemyapi.com/api/keyword/
  • www.nactem.ac.uk/software/termine/#form
  • www.alchemyapi.com/api/keyword/

回答by Dmitri

This is not easy to do because it requires some type of fuzzy logic. You should use the Yahoo Term extractor YQL

这并不容易,因为它需要某种类型的模糊逻辑。你应该使用雅虎术语提取器 YQL

Check it out: link

看看:链接

回答by Raynos

Depending on whether you want to show the client keywords/tags or whether you want to extract the keywords / tags from the block of text then do further computation with them.

根据您是要显示客户端关键字/标签还是要从文本块中提取关键字/标签,然后对它们进行进一步计算。

If you only need to show them then clientside handling is fine. If you need them for further computation then use serverside handling for it.

如果您只需要显示它们,那么客户端处理就可以了。如果您需要它们进行进一步计算,请使用服务器端处理。

I can recommend a javascript clientside implementation if you can supply some more details. If you want to generically "know" the keywords then some kind of clever solution is neccesary

如果您可以提供更多详细信息,我可以推荐一个 javascript 客户端实现。如果您想一般地“知道”关键字,那么需要某种巧妙的解决方案

If you have a list of keywords then you can use regular expressionsto extract the data

如果您有关键字列表,则可以使用正则表达式来提取数据