postgresql Postgres中字符串的词频?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5226202/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-20 22:53:05  来源:igfitidea点击:

Word frequencies from strings in Postgres?

postgresqltextnlpword-frequency

提问by Marty

Is it possible to identify distinct words and a count for each, from fields containing text strings in Postgres?

是否可以从 Postgres 中包含文本字符串的字段中识别不同的单词和每个单词的计数?

回答by a_horse_with_no_name

Something like this?

像这样的东西?

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

Getting the distinct words is easy then:

那么获得不同的词很容易:

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

or getting the count for each word:

或获取每个单词的计数:

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

回答by Charl Botha

You could also use the PostgreSQL text-searching functionality for this, for example:

您还可以为此使用 PostgreSQL 文本搜索功能,例如:

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

will yield:

将产生:

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simpleinstead of the englishdictionary, see below.)

(PostgreSQL 应用依赖于语言的词干提取和停用词去除,这可能是您想要的,也可能不是。停用词去除和词干去除可以通过使用simple而不是english字典来禁用,见下文。)

The nested SELECTstatement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvectorfunction to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:

嵌套SELECT语句可以是生成 tsvector 列的任何选择语句,因此您可以替换一个函数,该to_tsvector函数将该函数应用于任意数量的文本字段,并将它们连接成单个tsvector, 覆盖文档的任何子集,例如:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

Would yield a matrix of total word counts taken from the titleand bodyfields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoccolumn).

将生成从前500 个文档的titlebody字段中获取的总字数矩阵,按出现次数降序排序。对于每个单词,您还将获得它出现在(ndoc列)中的文档数。

See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

有关更多详细信息,请参阅文档:http: //www.postgresql.org/docs/current/static/textsearch.html

回答by ycui

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.

单词之间应由空格“ ”或其他分隔符分隔;不使用 's',除非有意这样做,例如,将 'myWordshere' 视为 'myWord' 和 'here'。

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word