database 何时建立单独的报告数据库?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3331637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
When to build a separate reporting database?
提问by Adrian K
We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.
我们正在构建一个具有数据库的应用程序(是的,非常令人兴奋,呵呵:)。数据库主要是事务性的(以支持应用程序),并且作为应用程序的一部分还做了一些“报告”——但没有什么太费力的。
Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.
除此之外,我们还有一些报告要求——但目前这些要求非常模糊和高级。我们有一个内部使用的标准报告工具,随着需求的巩固,我们将使用它来进行“更重”的报告。
My question is: how do you know when a separate database for reporting is required?
我的问题是:您如何知道何时需要单独的数据库进行报告?
What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?
需要问什么样的问题?什么样的事情会让你决定一个单独的报告数据库是必要的?
采纳答案by Rob
In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.
一般来说,事务性应用程序的任务越关键,报告要求越复杂,拆分就越有意义。
- When transaction performance is critical.
- When it's hard to get a maintenance window on the transactional app.
- If reporting needs to correlate results not only from this app, but from other application silos.
- If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
- If the reports are long running.
- If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
- If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).
- 当事务性能至关重要时。
- 当很难在事务性应用程序上获得维护窗口时。
- 如果报告不仅需要关联来自该应用程序的结果,还需要关联来自其他应用程序孤岛的结果。
- 如果报告需要支持最适合星型模式/商业智能环境的趋势或其他类型的报告。
- 如果报告长时间运行。
- 如果事务应用程序位于昂贵的硬件资源(集群、大型机等)上
- 如果您需要对事务数据进行数据清理/提取-转换-加载操作(例如,状态名称到规范状态缩写)。
It adds non-trivial complexity, so imo, there has to be a good reason to split.
它增加了非平凡的复杂性,所以 imo,必须有一个很好的理由来拆分。
回答by Cade Roux
Typically, I would try to report off the transactional database initially.
通常,我最初会尝试报告事务数据库。
Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.
确保您为促进高效报告而添加的任何索引都经常使用。添加的索引越多,插入和(如果更改键)更新的性能就越差。
When you do go to a reporting database, remember there are only a few reasons you are going there:
当你去报告数据库时,记住你去那里只有几个原因:
Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.
最终,关于报告数据库的第一件事是您正在从 OLTP 数据库中消除锁定争用。因此,如果您的报告数据库是同一数据库的直接副本,则您只需使用不会干扰生产事务的延迟快照。
Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.
接下来,您可以有一个单独的索引策略来支持报告使用场景。这些额外的索引可以在报告数据库中维护,但会在 OLTP 数据库中造成不必要的开销。
Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).
现在,以上两项都可以在同一台服务器上完成(即使是在单独的数据库中的同一实例,甚至只是在单独的模式中),并且仍然可以看到好处。当 CPU 和 IO 完全挂钩时,您肯定需要将它放在一个完全独立的盒子上(或升级您的单个盒子)。
Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.
最后,为了获得最终的报告灵活性,您对数据进行非规范化(通常为维度模型或星型模式),以便报告数据库是不同模型中的相同数据。在维度模型中报告大量数据(特别是聚合)非常快,因为星型模式对此非常有效。对于更广泛的查询,无需大量重新索引或分析来更改索引,它也是有效的,因为维度模型更适合不可预见的使用模式(旧的“切片和切块”请求)。您可以将其视为一种小型数据仓库,您可以在其中使用数据仓库技术,但不一定实施成熟的数据仓库。此外,星型模式特别容易让用户掌握,和数据字典更简单,更容易从星型模式为 BI 工具或报告工具构建。您可以在同一个盒子或不同的盒子等上执行此操作,就像之前讨论的那样。
回答by Misa J.
This question requires experience rather than science.
这个问题需要经验而不是科学。
As a BI architect, the approach I take on designing each BI solution for my clients are very different. I don't go through a checklist. It requires a general understanding of their system, their reporting requirements, budget and man power.
作为一名 BI 架构师,我为我的客户设计每个 BI 解决方案所采用的方法是非常不同的。我不检查清单。它需要对他们的系统、他们的报告要求、预算和人力有一个大致的了解。
I personally prefer to keep the reporting processes as much as possible on the database side (Best practice in BI world). REPORTING TOOLS ARE FOR DISPLAYING PURPOSE ONLY (MAXIMUM FOR SMALL CALCULATIONS). This approach requires a lot of pre-processing of data which requires different staging tables, triggers and etc.
我个人更喜欢将报告流程尽可能多地保留在数据库端(BI 世界中的最佳实践)。报告工具仅用于展示目的(小规模计算的最大值)。这种方法需要对数据进行大量预处理,这需要不同的暂存表、触发器等。
When you said:
当你说:
I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
我处理的项目有数亿行的实时报告以及数百名用户同时访问应用程序/数据库而没有问题。
There are a few things wrong with your statement.
你的陈述有一些错误。
Hundreds of millions of rows are A LOT. even today's in memory tools like Cognos TM1 or Qlikview would struggle to get such a results. (look at SAP HANA from SAP to understand how giants in the industry handle it).
If you have Hundreds of millions of rows in database, it doesn't necessarily mean that the report needs to go through all those records. maybe the report worked on 1000s not millions. probably that's what you saw.
Transactional reports are very different than dashboards. Most dashboard tools pre-processing and cache the data.
数以亿计的行很多。即使是今天的内存工具,如 Cognos TM1 或 Qlikview 也很难获得这样的结果。(查看 SAP 的 SAP HANA,了解行业巨头如何处理)。
如果数据库中有数亿行,并不一定意味着报表需要遍历所有这些记录。也许这份报告适用于 1000 多个而不是数百万。可能这就是你所看到的。
交易报告与仪表板非常不同。大多数仪表板工具预处理和缓存数据。
My point is that it all comes to experience for deciding when to:
我的观点是,决定何时:
- design a new schema
- create a semantic database
- work on the same transactional database
- or even use a reporting tool (Sometimes handwritten dashboards with Java/JSF/Ajax/jQuery or JSP would work fine for client)
- 设计新模式
- 创建语义数据库
- 在同一个事务数据库上工作
- 甚至使用报告工具(有时使用 Java/JSF/Ajax/jQuery 或 JSP 的手写仪表板对客户端来说可以正常工作)
回答by Corith Malin
The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.
您需要单独的数据库来报告问题的主要原因是报告的生成干扰了应用程序的事务职责。例如,如果报告需要 20 分钟来生成并使用 100% 的 CPU/磁盘/等......在高活动期间,您可能会考虑使用单独的数据库进行报告。
As for questions, here are some basic one:
至于问题,这里有一些基本的问题:
- Can I do the high intensity reports during non-peak hours?
- Does it interfere with the users using the system?
- If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?
- 我可以在非高峰时间做高强度报告吗?
- 它会干扰用户使用系统吗?
- 如果 #2 是肯定的,干扰的成本与另一个数据库服务器、重构代码等的成本是多少?
回答by DVK
Basically, when the database load from the app becomes incompatible with the database load for reporting. This could be due to:
基本上,当来自应用程序的数据库负载与用于报告的数据库负载不兼容时。这可能是由于:
Reporting consuming inordinate amount of database server resources impacting the app's DB performance.
A part of this category would be the app DB work having to wait on a majorly slow report query due to locking, though it might be possible to resolve with less drastic methods like locking tuning.
Reporting queries being very incompatible with app queries as far as tuning (e.g. indices but not limited to that) - the most dumb example would be something like a hot spot affecting app inserts because of the reporting-purpose index.
Timing issues. E.g. the only small windows for DB maintenance available (due to application usage) are the times of heavy reporting work
Reporting data's sheer volume (e.g. logging, auditing, statistics) is so big that your primary DB server architecture is a bad solution for such reporting (see Sybase ASE vs. Sybase IQ). BTW, this is a real scenario - we moved our performance reporting to IQ because of this.
报告消耗过多影响应用程序数据库性能的数据库服务器资源。
此类别的一部分将是应用程序数据库工作必须等待由于锁定而导致的主要缓慢的报告查询,尽管可能可以使用锁定调整等不太激烈的方法来解决。
报告查询与应用程序查询在调优方面非常不兼容(例如索引但不限于此) - 最愚蠢的例子可能是由于报告目的索引而影响应用程序插入的热点。
时间问题。例如,数据库维护可用的唯一小窗口(由于应用程序使用)是繁重的报告工作时间
报告数据的庞大数量(例如日志记录、审计、统计)是如此之大,以至于您的主数据库服务器架构对于此类报告来说是一个糟糕的解决方案(请参阅 Sybase ASE 与 Sybase IQ)。顺便说一句,这是一个真实的场景——因此我们将我们的绩效报告转移到了 IQ。
回答by Deleo
I would also add another reason for which you might use a reporting database, and that is: CQRS pattern (Command Query Responsibility Separation).
我还将添加您可能使用报告数据库的另一个原因,那就是:CQRS 模式(命令查询职责分离)。
If you have a large number of users accessing and writing to a small set of data, you would do wise to consider this pattern. It basicly, in its simplest form, means that all your commands (Create, Update, Delete) are pushed to the transactional database. All of your queries (Read) are from your reporting database. This lets you freely scopy your architecture and upgrade function.
如果您有大量用户访问和写入一小组数据,那么考虑这种模式是明智之举。它基本上以最简单的形式表示您的所有命令(创建、更新、删除)都被推送到事务数据库。您的所有查询(读取)都来自您的报告数据库。这让您可以自由地复制您的架构和升级功能。
There are MUCH more to it in the pattern, I just mentioned the bit which was interesting due to your question regarding reporting database.
模式中还有更多内容,我刚刚提到了由于您关于报告数据库的问题而有趣的一点。
回答by Fratt
I would also add that transactional databases are meant to hold current state and oftentimes do so to be self-maintaining. You don't want transactional databases growing beyond their necessary means. When a workflow or transaction is complete then move that data out and into a Reporting database, which is much better designed to hold historical data.
我还要补充一点,事务数据库旨在保存当前状态,并且通常这样做是为了自我维护。您不希望事务数据库增长超出其必要的手段。当工作流或事务完成时,然后将该数据移出并移入报告数据库,该数据库设计用于保存历史数据要好得多。