MySQL 使用 SQL 确定文本字段的字数统计信息

Question

提问by Rob

I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is:

我最近一直在研究一些数据库搜索功能，并希望获得一些信息，例如每个文档的平均字数（例如数据库中的文本字段）。到目前为止，我发现的唯一一件事（没有在数据库之外选择语言进行处理）是：

SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
FROM documents

This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also interested in general solutions.

这似乎有效*，但您还有其他建议吗？我目前正在使用 MySQL 4（希望很快将此应用程序移至第 5 版），但我也对通用解决方案感兴趣。

Thanks!

谢谢！

* I can imagine that this is a pretty rough way to determine this as it does not account for HTML in the content and the like as well. That's OK for this particular project but again are there better ways?

* 我可以想象这是一种非常粗略的确定方法，因为它也没有考虑内容中的 HTML 等。对于这个特定项目来说没问题，但还有更好的方法吗？

Update:To define what I mean by "better": either more accurate, performs more efficiently, or is more "correct" (easy to maintain, good practice, etc). For the content I have available, the query above is fast enough and is accurate for this project, but I may need something similar in the future (so I asked).

更新：定义我所说的“更好”是什么意思：要么更准确，执行效率更高，要么更“正确”（易于维护，良好实践等）。对于我可用的内容，上面的查询足够快并且对于这个项目来说是准确的，但我将来可能需要类似的东西（所以我问了）。

Answer 1

回答by outis

The text handling capabilities of MySQL aren't good enough for what you want. A stored function is an option, but will probably be slow. Your best bet to process the data within MySQL is to add a user defined function. If you're going to build a newer version of MySQL anyway, you could also add a native function.

MySQL 的文本处理能力不足以满足您的需求。存储函数是一种选择，但可能会很慢。在 MySQL 中处理数据的最佳选择是添加用户定义的函数。如果您无论如何要构建更新版本的 MySQL，您还可以添加本机函数。

The "correct" way is to process the data outside the DB since DBs are for storage, not processing, and any heavy processing might put too much of a load on the DBMS. Additionally, calculating the word count outside of MySQL makes it easier to change the definition of what counts as a word. How about storing the word count in the DB and updating it when a document is changed?

“正确”的方法是处理 DB 外的数据，因为 DB 是用于存储而不是处理，任何繁重的处理都可能会给 DBMS 带来过多的负载。此外，在 MySQL 之外计算字数可以更轻松地更改计数为单词的定义。如何将字数存储在数据库中并在文档更改时更新它？

Example stored function:

示例存储函数：

DELIMITER $$
CREATE FUNCTION wordcount(str LONGTEXT)
       RETURNS INT
       DETERMINISTIC
       SQL SECURITY INVOKER
       NO SQL
  BEGIN
    DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
    DECLARE currChar, prevChar BOOL DEFAULT 0;
    SET maxIdx=char_length(str);
    SET idx = 1;
    WHILE idx <= maxIdx DO
        SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]';
        IF NOT prevChar AND currChar THEN
            SET wordCnt=wordCnt+1;
        END IF;
        SET prevChar=currChar;
        SET idx=idx+1;
    END WHILE;
    RETURN wordCnt;
  END
$$
DELIMITER ;

Answer 2

回答by Jules

This is quite a bit faster, though just slightly less accurate. I found it 4% light on the count, which is OK for "estimate" scenarios.

这要快得多，但准确性稍差一些。我发现它的计数为 4%，这对于“估计”场景来说是可以的。

SELECT
    ROUND (   
        (
            CHAR_LENGTH(content) - CHAR_LENGTH(REPLACE (content, " ", "")) 
        ) 
        / CHAR_LENGTH(" ")        
    ) AS count    
FROM documents

Answer 3

回答by Sasha Pachev

You can use the word_count()UDF from https://github.com/spachev/mysql_udf_bundle. I ported the logic from the accepted answer with a difference that my code only supports latin1 charset. The logic would need to be reworked to support other charsets. Also, both implementations always consider a non-alphanumeric character to be a delimiter, which may not always desirable - for example "teacher's book" is considered to be three words by both implementations.

您可以使用https://github.com/spachev/mysql_udf_bundle 中的word_count()UDF 。我从接受的答案中移植了逻辑，不同之处在于我的代码仅支持 latin1 字符集。需要重新设计逻辑以支持其他字符集。此外，这两种实现总是将非字母数字字符视为分隔符，这可能并不总是可取的 - 例如，“teacher's book”被两种实现都认为是三个单词。

The UDF version is, of course, significantly faster. For a quick test I tried both on a dataset from Project Guttenberg consisting of 9751 records totaling about 3 GB. The UDF did all of them in 18 seconds, while the stored function took 63 seconds to process just 30 records (which UDF does in 0.05 seconds). So the UDF is roughly 1000 times faster in this case.

UDF 版本当然要快得多。为了进行快速测试，我在来自 Project Guttenberg 的数据集上进行了尝试，该数据集包含 9751 条记录，总计约 3 GB。UDF 在 18 秒内完成了所有这些，而存储函数只需要 63 秒来处理 30 条记录（UDF 在 0.05 秒内完成）。因此，在这种情况下，UDF 大约快 1000 倍。

UDF will beat any other method in speed that does not involve modifying MySQL source code. This is because it has access to the string bytes in memory and can operate directly on bytes without them having to be moved around. It is also compiled into machine code and runs directly on the CPU.

UDF 将在速度上击败任何其他不涉及修改 MySQL 源代码的方法。这是因为它可以访问内存中的字符串字节，并且可以直接对字节进行操作而不必移动它们。它也被编译成机器码，直接在CPU上运行。

MySQL 使用 SQL 确定文本字段的字数统计信息

提问by Rob

回答by outis

回答by Jules

回答by Sasha Pachev

相关推荐

最近更新

标签

MySQL 使用 SQL 确定文本字段的字数统计信息

提问by Rob

回答by outis

回答by Jules

回答by Sasha Pachev

相关推荐

相当于 MySQL 中的 MSSQL IDENTITY 列

选择 MAX(ID) mysql

MySQL 如何修复：mysql_connect()：连接过多

MySQL DETERMINISTIC、NO SQL 或 READS SQL DATA 在其声明和二进制日志记录已启用

相关推荐

最近更新

标签