MySQL 使用 SQL 确定文本字段的字数统计信息
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/748276/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using SQL to determine word count stats of a text field
提问by Rob
I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is:
我最近一直在研究一些数据库搜索功能,并希望获得一些信息,例如每个文档的平均字数(例如数据库中的文本字段)。到目前为止,我发现的唯一一件事(没有在数据库之外选择语言进行处理)是:
SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
FROM documents
This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also interested in general solutions.
这似乎有效*,但您还有其他建议吗?我目前正在使用 MySQL 4(希望很快将此应用程序移至第 5 版),但我也对通用解决方案感兴趣。
Thanks!
谢谢!
* I can imagine that this is a pretty rough way to determine this as it does not account for HTML in the content and the like as well. That's OK for this particular project but again are there better ways?
* 我可以想象这是一种非常粗略的确定方法,因为它也没有考虑内容中的 HTML 等。对于这个特定项目来说没问题,但还有更好的方法吗?
Update:To define what I mean by "better": either more accurate, performs more efficiently, or is more "correct" (easy to maintain, good practice, etc). For the content I have available, the query above is fast enough and is accurate for this project, but I may need something similar in the future (so I asked).
更新:定义我所说的“更好”是什么意思:要么更准确,执行效率更高,要么更“正确”(易于维护,良好实践等)。对于我可用的内容,上面的查询足够快并且对于这个项目来说是准确的,但我将来可能需要类似的东西(所以我问了)。
回答by outis
The text handling capabilities of MySQL aren't good enough for what you want. A stored function is an option, but will probably be slow. Your best bet to process the data within MySQL is to add a user defined function. If you're going to build a newer version of MySQL anyway, you could also add a native function.
MySQL 的文本处理能力不足以满足您的需求。存储函数是一种选择,但可能会很慢。在 MySQL 中处理数据的最佳选择是添加用户定义的函数。如果您无论如何要构建更新版本的 MySQL,您还可以添加本机函数。
The "correct" way is to process the data outside the DB since DBs are for storage, not processing, and any heavy processing might put too much of a load on the DBMS. Additionally, calculating the word count outside of MySQL makes it easier to change the definition of what counts as a word. How about storing the word count in the DB and updating it when a document is changed?
“正确”的方法是处理 DB 外的数据,因为 DB 是用于存储而不是处理,任何繁重的处理都可能会给 DBMS 带来过多的负载。此外,在 MySQL 之外计算字数可以更轻松地更改计数为单词的定义。如何将字数存储在数据库中并在文档更改时更新它?
Example stored function:
示例存储函数:
DELIMITER $$
CREATE FUNCTION wordcount(str LONGTEXT)
RETURNS INT
DETERMINISTIC
SQL SECURITY INVOKER
NO SQL
BEGIN
DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
DECLARE currChar, prevChar BOOL DEFAULT 0;
SET maxIdx=char_length(str);
SET idx = 1;
WHILE idx <= maxIdx DO
SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]';
IF NOT prevChar AND currChar THEN
SET wordCnt=wordCnt+1;
END IF;
SET prevChar=currChar;
SET idx=idx+1;
END WHILE;
RETURN wordCnt;
END
$$
DELIMITER ;
回答by Jules
This is quite a bit faster, though just slightly less accurate. I found it 4% light on the count, which is OK for "estimate" scenarios.
这要快得多,但准确性稍差一些。我发现它的计数为 4%,这对于“估计”场景来说是可以的。
SELECT
ROUND (
(
CHAR_LENGTH(content) - CHAR_LENGTH(REPLACE (content, " ", ""))
)
/ CHAR_LENGTH(" ")
) AS count
FROM documents
回答by Sasha Pachev
You can use the word_count()
UDF from https://github.com/spachev/mysql_udf_bundle. I ported the logic from the accepted answer with a difference that my code only supports latin1 charset. The logic would need to be reworked to support other charsets. Also, both implementations always consider a non-alphanumeric character to be a delimiter, which may not always desirable - for example "teacher's book" is considered to be three words by both implementations.
您可以使用https://github.com/spachev/mysql_udf_bundle 中的word_count()
UDF 。我从接受的答案中移植了逻辑,不同之处在于我的代码仅支持 latin1 字符集。需要重新设计逻辑以支持其他字符集。此外,这两种实现总是将非字母数字字符视为分隔符,这可能并不总是可取的 - 例如,“teacher's book”被两种实现都认为是三个单词。
The UDF version is, of course, significantly faster. For a quick test I tried both on a dataset from Project Guttenberg consisting of 9751 records totaling about 3 GB. The UDF did all of them in 18 seconds, while the stored function took 63 seconds to process just 30 records (which UDF does in 0.05 seconds). So the UDF is roughly 1000 times faster in this case.
UDF 版本当然要快得多。为了进行快速测试,我在来自 Project Guttenberg 的数据集上进行了尝试,该数据集包含 9751 条记录,总计约 3 GB。UDF 在 18 秒内完成了所有这些,而存储函数只需要 63 秒来处理 30 条记录(UDF 在 0.05 秒内完成)。因此,在这种情况下,UDF 大约快 1000 倍。
UDF will beat any other method in speed that does not involve modifying MySQL source code. This is because it has access to the string bytes in memory and can operate directly on bytes without them having to be moved around. It is also compiled into machine code and runs directly on the CPU.
UDF 将在速度上击败任何其他不涉及修改 MySQL 源代码的方法。这是因为它可以访问内存中的字符串字节,并且可以直接对字节进行操作而不必移动它们。它也被编译成机器码,直接在CPU上运行。