Oracle 数据仓库设计——事实表充当维度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1638076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Oracle data warehouse design - fact table acting as a dimension?
提问by user158017
THANKS: Both answers here are very helpful, but I could only pick one. I really appreciate the advice!
谢谢:这里的两个答案都非常有帮助,但我只能选择一个。我真的很感谢你的建议!
our datawarehouse will be used more for workflow reports than traditional analytical reports. Our users care about "current picture" far more than history. (though history matters, too.) We are a government entity that does not have costs or related calculations. Mostly just counts of people within given locations and with related history.
我们的数据仓库将更多地用于工作流报告,而不是传统的分析报告。我们的用户关心“当前图片”远远超过历史。(尽管历史也很重要。)我们是一个没有成本或相关计算的政府实体。大多数情况下只是给定地点内和相关历史的人数。
We are using Oracle, and I have found distinct advantage in using the star join whenever possible and would like to rearchitect everything to as closely resemble the star schema as is reasonable for our business uses. Speed in this DW is vital, and a number of tests have already proven the star schema approach to me.
我们正在使用 Oracle,我发现在任何可能的情况下使用星型连接都有明显的优势,并希望重新构建所有内容,使其与我们的业务用途合理的星型模式非常相似。这个 DW 中的速度至关重要,许多测试已经向我证明了星型模式方法。
Our "person" table is key - it contains over 4 million records and will be the most frequently used source in queries.It can be seen at the center of a star with multiple dimensions (like age, gender, affiliation, location, etc.). It is a very LONG table, particularly when I join it to the address and contact information.
我们的“person”表是关键——它包含超过 400 万条记录,将成为查询中最常用的来源。它可以在具有多个维度(如年龄、性别、隶属关系、位置等)的恒星的中心看到。这是一个很长的表格,特别是当我将它加入地址和联系信息时。
However, it is more like a dimension table when we start looking at history. For example, there are two different history tables that have a person key pointing to the person table. One has over 20 million records and the other has almost 50 million and grows daily.
但是,当我们开始查看历史时,它更像是一个维度表。例如,有两个不同的历史表,它们的 person 键指向 person 表。一个有超过 2000 万条记录,另一个有近 5000 万条记录并且每天都在增长。
Is this table a fact table or a dimension table? Can one work as both? If so, is that going to be a big performance problem? Is it common to query more off of a dimension than a fact? What happens if a DIFFERENT fact table that uses the person table as a dimension is actually only 60,000 records (much smaller.).
这个表是事实表还是维度表?一个可以同时工作吗?如果是这样,这将是一个很大的性能问题吗?查询维度多于事实是否常见?如果使用 person 表作为维度的 DIFFERENT 事实表实际上只有 60,000 条记录(小得多)会发生什么。
I think my problem is that our data and use of it does not fit with the commonly use examples of star schemas.
我认为我的问题是我们的数据和它的使用与星型模式的常用示例不符。
CLARIFICATION:Some good thoughts have been added below, but perhaps I left too much out to really explain well. Here's some more info:
澄清:下面添加了一些好的想法,但也许我遗漏了太多来真正解释清楚。以下是更多信息:
We handle a voter database. We don't have any measures except voter counts by various groups: voter counts by party, by age, by location; voter counts by ballot type and election, by ballot status and election, etc. We do have a "voting history" log as well as an activity audit log (change of address, party, etc.). We have information on which voters are election workers and all that related information. I figure I'll get to the peripheral stuff later.
我们处理选民数据库。除了按不同群体统计选民人数外,我们没有任何措施:按政党、年龄、地点统计选民人数;选民按选票类型和选举、选票状态和选举等进行计数。我们确实有“投票历史”日志以及活动审核日志(地址、政党等的更改)。我们有关于哪些选民是选举工作人员的信息以及所有相关信息。我想稍后我会谈到外围的东西。
For now I'm focusing on our two major "business processes": voter registration(which IS a voter.) and election turnout. In the first, voter is a fact. In the second, voter is a dimension, along with party, election, and type of ballot. (and in case anyone is worried - no we don't know HOW people vote. Just that they do. LOL )
现在我专注于我们的两个主要“业务流程”:选民登记(即选民)和选举投票率。首先,选民是一个事实。在第二种情况下,选民是一个维度,还有政党、选举和选票类型。(万一有人担心 - 不,我们不知道人们如何投票。只是他们这样做。大声笑)
I hope that clarifies things a bit.
我希望这能澄清一些事情。
采纳答案by user158017
ok - this isn't a full "answer", but it's close.
好的 - 这不是一个完整的“答案”,但已经接近了。
Notice this blog entry describing a Kimball lecture: http://database-geek.com/2005/03/28/a-day-with-ralph-kimball-part-2/
请注意这篇描述 Kimball 讲座的博客条目:http: //database-geek.com/2005/03/28/a-day-with-ralph-kimball-part-2/
The reason I'm struggling is that this is a "degenerate" dimension. My voter regnum and associated information is one to one with my "registration" fact table. So it appears that it's even ok with Kimball to throw that into the fact table.
我挣扎的原因是这是一个“退化”的维度。我的选民登记号和相关信息与我的“登记”事实表一一对应。因此,看起来 Kimball 甚至可以将其放入事实表中。
So now I'm just looking into what happens when a fact table is used by another fact table.
所以现在我只是研究当一个事实表被另一个事实表使用时会发生什么。
EDIT: Also, I have found googling the term "monster dimension" to be very helpful. This is much like a slowly changing customer dimension. As long as I am willing to snowflake, I can achieve what I need - star joins when querying voter, and not causing problems to use voter as a dimension for various fact tables.
编辑:另外,我发现在谷歌上搜索“怪物维度”这个词非常有帮助。这很像一个缓慢变化的客户维度。只要我愿意下雪花,我就能实现我所需要的——在查询voter时加入star join,并且使用voter作为各种事实表的维度不会造成问题。
EDIT: Here was my final conclusion: As advised above, the point is to facilitate business process, not to fit the textbook diagram.
编辑:这是我的最终结论:如上所述,重点是促进业务流程,而不是适合教科书图表。
Our business is such that there is absolutely no reason to split apart the voter table (having a fact table for "registrations" and a dimension for "voters") - when querying with that table we will want all the attributes as well as all the flags and text information. I would not want to break the attributes out separately into the "fact" (like Kimball's book shows for customers and orders) because those attributes mean something different when attached to the facts as than when they are attached to the dimensions. Further, voters is used as an attribute in multiple other places, some of which DO fit a traditional star.
我们的业务是绝对没有理由拆分选民表(有一个用于“注册”的事实表和一个用于“选民”的维度)——当使用该表进行查询时,我们将需要所有属性以及所有标志和文本信息。我不想将属性单独分解为“事实”(如 Kimball 为客户和订单展示的书展),因为这些属性在附加到事实时与附加到维度时具有不同的含义。此外,选民在其他多个地方被用作属性,其中一些确实适合传统明星。
My main purpose is SPEED. So I chose a modified format - a lot like the snowflake - where voter is the center of multiple tables and oracle can use the star join when I index everything right. Then, I use voter as a dimension in all my other "stars". In every case, I set it up so that most if not all tables can be joined using the star join, even though it isn't "textbook."
我的主要目的是速度。所以我选择了一种修改后的格式——很像雪花——其中选民是多个表的中心,当我索引一切正确时,oracle 可以使用星形连接。然后,我使用选民作为我所有其他“明星”的维度。在每种情况下,我都会对其进行设置,以便即使不是“教科书”,也可以使用星形联接来联接大多数表(如果不是全部)。
Thanks again for the help!
再次感谢您的帮助!
回答by Irwin M. Fletcher
If possible, my suggestion would be to refactor these tables so they are more in alignment with a true star schema. Although 50 million records sounds like a lot (when thinking about a transactional system) we have multiple fact tables with as many as 500 million rows. Assuming that your hardware was speced for this type of work, you should not have any issues with combining your tables into one large fact table (assuming they are all within the same subject area).
如果可能,我的建议是重构这些表,使它们更符合真正的星型模式。尽管 5000 万条记录听起来很多(考虑到事务系统时),但我们有多个包含多达 5 亿行的事实表。假设您的硬件是为此类工作指定的,那么将您的表合并到一个大的事实表中应该不会有任何问题(假设它们都在同一主题区域内)。
With that said, make sure that you account for the other factors that should be considered when choosing a highly denormalized structure. The star schema is a great design for reporting on data because of the reduction in the necessary joins, however, you often pay a large price for this while updating tables and in disk space. When you say that you are considering using this schema for more of a workflow application, rather than mostly analytics, then I would make sure to account for the updates. Are updates needed in real time or near real time? If so, again you may not want to consider a star.
话虽如此,请确保您考虑了在选择高度非规范化结构时应考虑的其他因素。由于减少了必要的联接,星型模式是一种很好的数据报告设计,但是,在更新表和磁盘空间时,您通常为此付出很大的代价。当您说您正在考虑将此架构用于更多的工作流应用程序,而不是主要用于分析时,那么我会确保考虑到更新。需要实时更新还是接近实时更新?如果是这样,您可能不想再考虑明星。
Finally, yes in some cases we query only our dimension tables, often when an application needs a specific list of items (i.e. products, customers, etc), this is a valid use, however, a better solution would likely leverage an ODS rather than our star schema.
最后,是的,在某些情况下,我们只查询我们的维度表,通常当应用程序需要特定的项目列表(即产品、客户等)时,这是一个有效的用途,但是,更好的解决方案可能会利用 ODS 而不是我们的星型模式。
What I have found is as much as I try to make my schema look like something right out of an Inmon or Kimball textbook, it almost never works without some real world custimization.
我发现就像我试图让我的模式看起来像 Inmon 或 Kimball 教科书中的东西一样,如果没有一些现实世界的自定义,它几乎永远不会奏效。
EditI sure have been more specific with the reference to the ODS.
编辑我肯定已经更具体地参考了 ODS。
An operational data store (or "ODS") is a database designed to integrate data from multiple sources to make analysis and reporting easier. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the Data warehouse generally on a less frequent basis.
运营数据存储(或“ODS”)是一种数据库,旨在集成来自多个来源的数据,以便更轻松地进行分析和报告。由于数据来自多个来源,因此集成通常涉及清理、解决冗余和检查业务规则的完整性。ODS 通常被设计为包含具有有限历史的低级或原子(不可分割)数据(例如交易和价格),这些数据是“实时”或“近乎实时”捕获的,而不是存储在数据库中的大量数据。数据仓库通常不太频繁。
According to Bill Inmon, the originator of the concept, an ODS is "a subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in support of an organization's need for up-to-the-second, operational, integrated, collective information."
根据该概念的创始人比尔·英蒙 (Bill Inmon) 的说法,ODS 是“面向主题的、集成的、易变的、当前价值的、仅详细的数据集合,以支持组织对最新、可操作的需求,综合的、集体的信息。”
ODS differ from Inmon's definition of enterprise data warehouse by having a limited history, and more frequent update than an EDW. In practice ODS tend to be more reflective of source structures in order to speed implementations and provide a truer representation of production data.
ODS 与 Inmon 对企业数据仓库的定义不同,因为它具有有限的历史记录,并且比 EDW 更新更频繁。在实践中,ODS 往往更能反映源结构,以加快实施速度并提供更真实的生产数据表示。
回答by Damir Sudarevic
Large "people" (customer) dimensions are frequent in telecom, banking, insurance etc. Kimball has a section named "Large Changing Customer Dimensions" under CRM chapter (6). It shows how to create "minidimensions". Frequently changing or frequently analyzed attributes (columns) are broken-off into separate mini-dimension tables. These mini-dimensions are connected via fact table, so fact table has a FK for each of these tables separately.
大型“人”(客户)维度在电信、银行、保险等领域很常见。Kimball 在 CRM 章节 (6) 下有一个名为“大型变化客户维度”的部分。它展示了如何创建“最小维度”。经常变化或经常分析的属性(列)被分解成单独的小维度表。这些小维度通过事实表连接,因此事实表对这些表中的每一个都有一个 FK。
It seems to me that your example is close to this.
在我看来,您的示例与此很接近。
As a general rule, dimension table is a look-up table for objects which rarely change (people, accounts, time, products, stores) and fact table captures activity (history) of interactions between these objects. Fact table contains measures that you would want to aggregate (total sales, number of hours worked, number of parts produced, etc..).
作为一般规则,维度表是很少改变的对象(人员、帐户、时间、产品、商店)的查找表,事实表捕获这些对象之间交互的活动(历史)。事实表包含您想要聚合的度量(总销售额、工作小时数、生产的零件数量等)。
AFTER CLARIFICATION:
I would say that Voter is actually a conformed dimension -- common for all data marts (business processes). Other conformed dimensions would be: Date, Party, Elections, VotingStations. Mini-dimensions would be Demographic and GeoArea.
Fact tables would be: RegistrationEvent (who when and where registered) and ElectionEvent (who when and where voted in which election, using what).
Dimension Voter and fact RegistrationEvent are loaded from operational systems which capture voter registration and other changes.
This is simplified, but I hope it captures the basic idea.
澄清之后:
我会说 Voter 实际上是一个一致的维度——对于所有数据集市(业务流程)来说都是通用的。其他一致的维度是:日期、派对、选举、投票站。迷你维度将是人口统计和地理区域。事实表将是:RegistrationEvent(谁何时何地注册)和 ElectionEvent(谁何时何地在哪个选举中投票,使用什么)。
Dimension Voter 和 fact RegistrationEvent 从捕获选民注册和其他更改的操作系统加载。
这是简化的,但我希望它抓住了基本思想。