SQL 标记数据库设计

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48475/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 23:16:47  来源:igfitidea点击:

Database Design for Tagging

sqldatabase-designtagstagging

提问by Christian Berg

How would you design a database to support the following tagging features:

您将如何设计数据库以支持以下标记功能:

  • items can have a large number of tags
  • searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
  • creating/writing items may be slower to enable quick lookup/reading
  • 项目可以有大量标签
  • 搜索带有一组给定标签的所有项目必须快速(项目必须具有所有标签,因此它是 AND 搜索,而不是 OR 搜索)
  • 创建/写入项目可能会较慢以启用快速查找/阅读

Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.

理想情况下,应该使用单个 SQL 语句来查找所有标记有(至少)n 个给定标记的项目。由于要搜索的标签数量以及任何项目上的标签数量是未知的并且可能很高,因此使用 JOIN 是不切实际的。

Any ideas?

有任何想法吗?



Thanks for all the answers so far.

感谢您到目前为止的所有答案。

If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)

但是,如果我没有记错的话,给出的答案显示了如何对标签进行 OR 搜索。(选择具有 n 个标签中的一个或多个的所有项目)。我正在寻找有效的 AND 搜索。(选择具有 ALL n 个标签的所有项目 - 可能还有更多。)

采纳答案by Troels Arvin

About ANDing: It sounds like you are looking for the "relational division" operation. This articlecovers relational division in concise and yet comprehendible way.

关于ANDing:听起来您正在寻找“关系除法”操作。本文以简洁但易于理解的方式介绍了关系划分。

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

关于性能:基于位图的方法直观上听起来很适合这种情况。但是,我不相信“手动”实现位图索引是一个好主意,就像 digiguru 建议的那样:每当添加新标签时,这听起来像是一个复杂的情况(?)但是一些 DBMS(包括 Oracle)提供位图索引,这可能以某种方式有用,因为内置索引系统消除了索引维护的潜在复杂性;此外,提供位图索引的 DBMS 在执行查询计划时应该能够正确地考虑它们。

回答by Jeff Atwood

Here's a good article on tagging Database schemas:

这是一篇关于标记数据库模式的好文章:

http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/

http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/

along with performance tests:

以及性能测试:

http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/

http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/

Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.

请注意,这里的结论是针对 MySQL 的,它(至少在 2005 年编写时)具有非常差的全文索引特性。

回答by Slartibartfast

I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"

我认为一个简单的解决方案没有问题:项目表、标签表、“标记”交叉表

Indices on cross table should be enough optimisation. Selecting appropriate items would be

交叉表上的索引应该足够优化。选择合适的项目将是

SELECT * FROM items WHERE id IN  
    (SELECT DISTINCT item_id FROM item_tag WHERE  
    tag_id = tag1 OR tag_id = tag2 OR ...)  

AND tagging would be

和标记将是

SELECT * FROM items WHERE  
    EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)  
    AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)  
    AND ...

which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.

诚然,对于大量比较标签来说效率不高。如果您要在内存中维护标签计数,您可以从不经常使用的标签开始查询,这样可以更快地评估 AND 序列。根据要匹配的标签的预期数量和匹配其中任何一个标签的预期,这可能是可以的解决方案,如果您要匹配 20 个标签,并期望某个随机项目将匹配其中的 15 个,那么这仍然很重在数据库上。

回答by Winston Fassett

I just wanted to highlight that the article that @Jeff Atwood links to (http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.

我只想强调@Jeff Atwood 链接到的文章(http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/)非常详尽(它讨论了 3 种不同模式的优点)方法)并且对 AND 查询有一个很好的解决方案,该解决方案的性能通常比到目前为止提到的要好(即它不为每个术语使用相关的子查询)。评论里也有很多好东西。

ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.

ps——这里大家讨论的方法在文章中被称为“Toxi”解决方案。

回答by Zizzencs

You might want to experiment with a not-strictly-database solution like a Java Content Repositoryimplementation (e.g. Apache Hymanrabbit) and use a search engine built on top of that like Apache Lucene.

您可能想尝试使用非严格的数据库解决方案,例如Java 内容存储库实现(例如Apache Hymanrabbit),并使用构建在其之上的搜索引擎,例如Apache Lucene

This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.

这种具有适当缓存机制的解决方案可能会产生比本土解决方案更好的性能。

However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.

但是,我真的不认为在中小型应用程序中,您需要比之前帖子中提到的规范化数据库更复杂的实现。

EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.

编辑:根据您的说明,在搜索引擎中使用类似 JCR 的解决方案似乎更具吸引力。从长远来看,这将大大简化您的程序。

回答by Brad Bruce

The easiest method is to create a tagstable.
Target_Type-- in case you are tagging multiple tables
Target-- The key to the record being tagged
Tag-- The text of a tag

最简单的方法是创建一个标签表。
Target_Type-- 如果您要标记多个表
Target-- 被标记记录的键 -- 标记
Tag的文本

Querying the data would be something like:

查询数据将类似于:

Select distinct target from tags   
where tag in ([your list of tags to search for here])  
and target_type = [the table you're searching]

UPDATE
Based on your requirement to AND the conditions, the query above would turn into something like this

UPDATE
根据您对 AND 条件的要求,上面的查询将变成这样

select target
from (
  select target, count(*) cnt 
  from tags   
  where tag in ([your list of tags to search for here])
    and target_type = [the table you're searching]
)
where cnt = [number of tags being searched]

回答by chakrit

I'd second @Zizzencs suggestion that you might want something that's not totally (R)DB-centric

我第二个@Zizzencs 建议你可能想要一些不完全以 (R)DB 为中心的东西

Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.

不知何故,我相信使用普通的 nvarchar 字段来存储带有一些适当缓存/索引的标签可能会产生更快的结果。但这只是我。

I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.

我之前已经实现了使用 3 个表来表示多对多关系的标记系统(项目标签 ItemTags),但我想你会在很多地方处理标签,我可以告诉你,3 个表必须一直同时操作/查询肯定会使您的代码更加复杂。

You might want to consider if the added complexity is worth it.

您可能需要考虑增加的复杂性是否值得。

回答by digiguru

What I like to do is have a number of tables that represent the raw data, so in this case you'd have

我喜欢做的是有许多代表原始数据的表,所以在这种情况下你有

Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)

This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.

这在写入时间上工作得很快,并保持一切正常化,但您可能还会注意到,对于每个标签,您需要为每个想要 AND 的其他标签连接两次表,因此读取速度很慢。

A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...

改进读取的一种解决方案是通过设置存储过程来根据命令创建缓存表,该存储过程实际上创建了以扁平格式表示数据的新表......

CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)

Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.

然后你可以考虑 Tagged Item 表需要保持最新的频率,如果它是在每次插入时,然后在游标插入事件中调用存储过程。如果它是每小时任务,则设置每小时作业来运行它。

Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.

现在要真正聪明地进行数据检索,您需要创建一个存储过程来从标签中获取数据。与其在大量 case 语句中使用嵌套查询,不如传入一个包含要从数据库中选择的标签列表的单个参数,并返回一个 Items 记录集。这最好是二进制格式,使用按位运算符。

In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that

在二进制格式中,很容易解释。假设有四个标签要分配给一个项目,我们可以用二进制表示

0000

If all four tags are assigned to an object, the object would look like this...

如果所有四个标签都分配给一个对象,该对象将如下所示......

1111

If just the first two...

如果只是前两个...

1100

Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.

那么这只是在您想要的列中找到带有 1 和 0 的二进制值的一种情况。使用 SQL Server 的按位运算符,您可以使用非常简单的查询来检查第一列中是否有 1。

Check this link to find out more.

查看此链接以了解更多信息

回答by Portman

To paraphrase what others have said: the trick isn't in the schema, it's in the query.

套用其他人所说的话:诀窍不在模式中,而在查询中

The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.

实体/标签/标签的朴素模式是正确的方法。但正如您所见,如何使用大量标签执行 AND 查询并不是很清楚。

The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".

优化该查询的最佳方法将取决于平台,因此我建议您使用 RDBS 重新标记您的问题,并将标题更改为“在标记数据库上执行 AND 查询的最佳方法”。

I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.

我有一些关于 MS SQL 的建议,但如果这不是您使用的平台,我会避免。

回答by FlySwat

You won't be able to avoid joins and still be somewhat normalized.

您将无法避免连接并且仍然有些标准化。

My approach is to have a Tag Table.

我的方法是有一个标签表。

 TagId (PK)| TagName (Indexed)

Then, you have a TagXREFID column in your items table.

然后,您的项目表中有一个 TagXREFID 列。

This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:

此 TagXREFID 列是第三个表的 FK,我将其称为 TagXREF:

 TagXrefID | ItemID | TagId

So, to get all tags for an item would be something like:

因此,要获取项目的所有标签将类似于:

SELECT Tags.TagId,Tags.TagName 
     FROM Tags,TagXref 
     WHERE TagXref.TagId = Tags.TagId 
         AND TagXref.ItemID = @ItemID

And to get all items for a tag, I'd use something like this:

为了获得一个标签的所有项目,我会使用这样的东西:

SELECT * FROM Items, TagXref
     WHERE TagXref.TagId IN 
          ( SELECT Tags.TagId FROM Tags
                WHERE Tags.TagName = @TagName; )
     AND Items.ItemId = TagXref.ItemId;

To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = @TagName1 AND Tags.TagName = @TagName2 etc...and dynamically build the query.

要将一堆标签 AND 放在一起,您需要稍微修改上面的语句以添加 AND Tags.TagName = @TagName1 AND Tags.TagName = @TagName2 等...并动态构建查询。