postgresql Postgres中字符串的词频？

Question

提问by Marty

Is it possible to identify distinct words and a count for each, from fields containing text strings in Postgres?

是否可以从 Postgres 中包含文本字符串的字段中识别不同的单词和每个单词的计数？

Answer 1

回答by a_horse_with_no_name

Something like this?

像这样的东西？

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

Getting the distinct words is easy then:

那么获得不同的词很容易：

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

or getting the count for each word:

或获取每个单词的计数：

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

Answer 2

回答by Charl Botha

You could also use the PostgreSQL text-searching functionality for this, for example:

您还可以为此使用 PostgreSQL 文本搜索功能，例如：

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

will yield:

将产生：

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simpleinstead of the englishdictionary, see below.)

（PostgreSQL 应用依赖于语言的词干提取和停用词去除，这可能是您想要的，也可能不是。停用词去除和词干去除可以通过使用simple而不是english字典来禁用，见下文。）

The nested SELECTstatement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvectorfunction to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:

嵌套SELECT语句可以是生成 tsvector 列的任何选择语句，因此您可以替换一个函数，该to_tsvector函数将该函数应用于任意数量的文本字段，并将它们连接成单个tsvector, 覆盖文档的任何子集，例如：

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

Would yield a matrix of total word counts taken from the titleand bodyfields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoccolumn).

将生成从前500 个文档的title和body字段中获取的总字数矩阵，按出现次数降序排序。对于每个单词，您还将获得它出现在（ndoc列）中的文档数。

See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

有关更多详细信息，请参阅文档：http: //www.postgresql.org/docs/current/static/textsearch.html

Answer 3

回答by ycui

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.

单词之间应由空格“ ”或其他分隔符分隔；不使用 's'，除非有意这样做，例如，将 'myWordshere' 视为 'myWord' 和 'here'。

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word

postgresql Postgres中字符串的词频？

提问by Marty

回答by a_horse_with_no_name

回答by Charl Botha

回答by ycui

相关推荐

最近更新

标签

postgresql Postgres中字符串的词频？

提问by Marty

回答by a_horse_with_no_name

回答by Charl Botha

回答by ycui

相关推荐

postgresql PGError：错误：运算符不存在：布尔值~~*未知

postgresql 将字符变化字段转换为日期

如何使用 PostgreSQL 执行真空？

配置热流复制时 postgresql 9.0 归档出错

相关推荐

最近更新

标签