如何提取 MySQL 字符串中的第 n 个单词并计算单词出现次数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4021507/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 17:33:30  来源:igfitidea点击:

How to extract the nth word and count word occurrences in a MySQL string?

mysqlregexword-count

提问by Noam

I would like to have a mysql query like this:

我想要一个像这样的 mysql 查询:

select <second word in text> word, count(*) from table group by word;

All the regex examples in mysql are used to query if the text matches the expression, but not to extract text out of an expression. Is there such a syntax?

mysql 中的所有正则表达式示例都是用于查询文本是否与表达式匹配,而不是从表达式中提取文本。有这样的语法吗?

回答by Brendan Bullen

The following is a proposed solution for the OP's specificproblem (extracting the 2nd word of a string), but it should be noted that, as mc0e's answer states, actually extracting regex matches is not supported out-of-the-box in MySQL. If you really need this, then your choices are basically to 1) do it in post-processing on the client, or 2) install a MySQL extension to support it.

以下是针对 OP特定问题(提取字符串的第二个单词)的建议解决方案,但应注意,正如 mc0e 的回答所述,MySQL 中不支持开箱即用地实际提取正则表达式匹配项。如果你真的需要这个,那么你的选择基本上是 1) 在客户端的后处理中进行,或者 2) 安装一个 MySQL 扩展来支持它。



BenWells has it very almost correct. Working from his code, here's a slightly adjusted version:

BenWells 的说法几乎是正确的。根据他的代码,这里有一个稍微调整的版本:

SUBSTRING(
  sentence,
  LOCATE(' ', sentence) + CHAR_LENGTH(' '),
  LOCATE(' ', sentence,
  ( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
)

As a working example, I used:

作为一个工作示例,我使用了:

SELECT SUBSTRING(
  sentence,
  LOCATE(' ', sentence) + CHAR_LENGTH(' '),
  LOCATE(' ', sentence,
  ( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
) as string
FROM (SELECT 'THIS IS A TEST' AS sentence) temp

This successfully extracts the word IS

这成功提取了单词 IS

回答by Damien Goor

Shorter option to extract the second word in a sentence:

提取句子中第二个单词的较短选项:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX('THIS IS A TEST', ' ',  2), ' ', -1) as FoundText

MySQL docs for SUBSTRING_INDEX

SUBSTRING_INDEX 的 MySQL 文档

回答by BenWells

According to http://dev.mysql.com/the SUBSTRING function uses start position then the length so surely the function for the second word would be:

根据http://dev.mysql.com/,SUBSTRING函数使用起始位置,那么长度肯定是第二个单词的函数:

SUBSTRING(sentence,LOCATE(' ',sentence),(LOCATE(' ',LOCATE(' ',sentence))-LOCATE(' ',sentence)))

回答by Mark Byers

No, there isn't a syntax for extracting text using regular expressions. You have to use the ordinary string manipulation functions.

不,没有使用正则表达式提取文本的语法。您必须使用普通的字符串操作函数

Alternatively select the entire value from the database (or the first n characters if you are worried about too much data transfer) and then use a regular expression on the client.

或者,从数据库中选择整个值(如果您担心数据传输过多,则选择前 n 个字符),然后在客户端上使用正则表达式。

回答by mc0e

As others have said, mysql does not provide regex tools for extracting sub-strings. That's not to say you can't have them though if you're prepared to extend mysql with user-defined functions:

正如其他人所说,mysql 不提供用于提取子字符串的正则表达式工具。这并不是说如果您准备使用用户定义的函数扩展 mysql,您就不能拥有它们:

https://github.com/mysqludf/lib_mysqludf_preg

https://github.com/mysqludf/lib_mysqludf_preg

That may not be much help if you want to distribute your software, being an impediment to installing your software, but for an in-house solution it may be appropriate.

如果您想分发您的软件,这可能不会有太大帮助,因为这会妨碍您安装软件,但对于内部解决方案,它可能是合适的。

回答by Hypolite Petovan

I used Brendan Bullen's answer as a starting point for a similar issue I had which was to retrive the value of a specific field in a JSON string. However, like I commented on his answer, it is not entirely accurate. If your left boundary isn't just a space like in the original question, then the discrepancy increases.

我使用 Brendan Bullen 的答案作为我遇到的类似问题的起点,该问题是检索 JSON 字符串中特定字段的值。但是,就像我评论他的回答一样,它并不完全准确。如果您的左边界不仅仅是原始问题中的空间,则差异会增加。

Corrected solution:

更正的解决方案:

SUBSTRING(
    sentence,
    LOCATE(' ', sentence) + 1,
    LOCATE(' ', sentence, (LOCATE(' ', sentence) + 1)) - LOCATE(' ', sentence) - 1
)

The two differences are the +1 in the SUBSTRING index parameter and the -1 in the length parameter.

两者的区别是 SUBSTRING 索引参数中的 +1 和长度参数中的 -1。

For a more general solution to "find the first occurence of a string between two provided boundaries":

对于“在两个提供的边界之间找到字符串的第一次出现”的更通用的解决方案:

SUBSTRING(
    haystack,
    LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'),
    LOCATE(
        '<rightBoundary>',
        haystack,
        LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>')
    )
    - (LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'))
)

回答by user483085

I don't think such a thing is possible. You can use SUBSTRINGfunction to extract the part you want.

我不认为这样的事情是可能的。您可以使用SUBSTRING函数来提取您想要的部分。

回答by Steve Chambers

My home-grown regular expression replace functioncan be used for this.

自己开发的正则表达式替换函数可用于此目的。

Demo

演示

See this DB-Fiddle demo, which returns the second word ("I") from a famous sonnet and the number of occurrences of it (1).

请参阅此 DB-Fiddle 演示,它返回一首著名十四行诗中的第二个单词 ("I") 及其出现次数 (1)。

SQL

SQL

Assuming MySQL 8 or later is being used (to allow use of a Common Table Expression), the following will return the second word and the number of occurrences of it:

假设使用 MySQL 8 或更高版本(以允许使用公共表表达式),以下将返回第二个单词及其出现次数:

WITH cte AS (
     SELECT digits.idx,
            SUBSTRING_INDEX(SUBSTRING_INDEX(words, '~', digits.idx + 1), '~', -1) word
     FROM
     (SELECT reg_replace(UPPER(txt),
                         '[^'''a-zA-Z-]+',
                         '~',
                         TRUE,
                         1,
                         0) AS words
      FROM tbl) delimited
     INNER JOIN
     (SELECT @row := @row + 1 as idx FROM 
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t1,
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t2, 
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t3, 
      (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t4, 
      (SELECT @row := -1) t5) digits
     ON LENGTH(REPLACE(words, '~' , '')) <= LENGTH(words) - digits.idx)
SELECT c.word,
       subq.occurrences
FROM cte c
LEFT JOIN (
  SELECT word,
         COUNT(*) AS occurrences
  FROM cte
  GROUP BY word
) subq
ON c.word = subq.word
WHERE idx = 1; /* idx is zero-based so 1 here gets the second word */

Explanation

解释

A few tricks are used in the SQL above and some accreditation is needed. Firstly the regular expression replacer is used to replace all continuous blocks of non-word characters - each being replaced by a single tilda (~) character. Note: A different character could be chosen instead if there is any possibility of a tilda appearing in the text.

上面的 SQL 中使用了一些技巧,需要一些认证。首先,正则表达式替换器用于替换所有连续的非单词字符块 - 每个块都被单个 tilda ( ~) 字符替换。注意:如果文本中可能出现波浪号,则可以选择不同的字符。

The technique from this answeris then used for transforming a string with delimited values into separate row values. It's combined with the clever technique from this answerfor generating a table consisting of a sequence of incrementing numbers: 0 - 10,000 in this case.

然后使用此答案中的技术将具有分隔值的字符串转换为单独的行值。它与此答案中的巧妙技术相结合,用于生成由一系列递增数字组成的表格:在本例中为 0 - 10,000。

回答by Antonio Rivera

The field's value is:

该字段的值为:

 "- DE-HEB 20% - DTopTen 1.2%"
SELECT ....
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DE-HEB ',  -1), '-', 1) DE-HEB ,
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DTopTen ',  -1), '-', 1) DTopTen ,

FROM TABLA 

Result is:

结果是:

  DE-HEB       DTopTEn
    20%          1.2%