postgresql Postgres中字符串的词频?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5226202/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Word frequencies from strings in Postgres?
提问by Marty
Is it possible to identify distinct words and a count for each, from fields containing text strings in Postgres?
是否可以从 Postgres 中包含文本字符串的字段中识别不同的单词和每个单词的计数?
回答by a_horse_with_no_name
Something like this?
像这样的东西?
SELECT some_pk, regexp_split_to_table(some_column, '\s') as word FROM some_table
Getting the distinct words is easy then:
那么获得不同的词很容易:
SELECT DISTINCT word FROM ( SELECT regexp_split_to_table(some_column, '\s') as word FROM some_table ) t
or getting the count for each word:
或获取每个单词的计数:
SELECT word, count(*) FROM ( SELECT regexp_split_to_table(some_column, '\s') as word FROM some_table ) t GROUP BY word
回答by Charl Botha
You could also use the PostgreSQL text-searching functionality for this, for example:
您还可以为此使用 PostgreSQL 文本搜索功能,例如:
SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');
will yield:
将产生:
word | ndoc | nentry
---------+------+--------
ridicul | 1 | 1
hello | 1 | 3
dere | 1 | 1
(3 rows)
(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simple
instead of the english
dictionary, see below.)
(PostgreSQL 应用依赖于语言的词干提取和停用词去除,这可能是您想要的,也可能不是。停用词去除和词干去除可以通过使用simple
而不是english
字典来禁用,见下文。)
The nested SELECT
statement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvector
function to any number of text fields, and concatenates them into a single tsvector
, over any subset of your documents, for example:
嵌套SELECT
语句可以是生成 tsvector 列的任何选择语句,因此您可以替换一个函数,该to_tsvector
函数将该函数应用于任意数量的文本字段,并将它们连接成单个tsvector
, 覆盖文档的任何子集,例如:
SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;
Would yield a matrix of total word counts taken from the title
and body
fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoc
column).
将生成从前500 个文档的title
和body
字段中获取的总字数矩阵,按出现次数降序排序。对于每个单词,您还将获得它出现在(ndoc
列)中的文档数。
See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html
有关更多详细信息,请参阅文档:http: //www.postgresql.org/docs/current/static/textsearch.html
回答by ycui
Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.
单词之间应由空格“ ”或其他分隔符分隔;不使用 's',除非有意这样做,例如,将 'myWordshere' 视为 'myWord' 和 'here'。
SELECT word, count(*)
FROM (
SELECT regexp_split_to_table(some_column, ' ') as word
FROM some_table
) t
GROUP BY word