Java 使用 Solr 搜索索引作为数据库 - 这是“错误的”吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4258593/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Solr search index as a database - is this "wrong"?
提问by Michael Moussa
My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
我的团队正在与使用 Solr 作为搜索索引的第三方 CMS 合作。我注意到作者似乎使用 Solr 作为排序数据库,因为每个返回的文档都包含两个字段:
- The Solr document ID (basically a classname and database id)
- An XML representation of the entire object
- Solr 文档 ID(基本上是类名和数据库 ID)
- 整个对象的 XML 表示
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
所以基本上它对 Solr 运行搜索,下载对象的 XML 表示,然后从 XML 实例化对象,而不是使用 id 在数据库中查找它。
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
我的直觉告诉我这是一个不好的做法。Solr 是一个搜索索引,而不是一个数据库......所以对我来说对 Solr 执行复杂的搜索,获取文档 ID,然后从数据库中提取相应的行更有意义。
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
当前的实现是否完全合理,或者是否有数据支持重构已经成熟的想法?
EDIT:When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.
编辑:当我说“XML 表示”时 - 我的意思是一个存储字段,其中包含所有对象属性的 XML 字符串,而不是多个存储字段。
采纳答案by jayunit100
Yes, you can use SOLR as a database but there are some really serious caveats :
是的,您可以将 SOLR 用作数据库,但有一些非常严重的警告:
SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.
Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.
Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.
The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.
Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.
Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.
Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.
SOLR 最常见的访问模式,即通过 http 对批量查询的响应不是特别好。此外,SOLR 不流式传输数据 --- 因此您不能一次懒惰地遍历数百万条记录。 这意味着您在使用 SOLR 设计大规模数据访问模式时必须非常周到。
尽管 SOLR 性能可以横向扩展(更多机器、更多内核等)以及纵向扩展(更多 RAM、更好的机器等),但与成熟的 RDBMS 相比,其查询能力受到严重限制。也就是说,有一些很好的功能,比如字段统计查询,非常方便。
习惯于使用关系数据库的开发人员在 SOLR 范式中使用相同的 DAO 设计模式时,经常会遇到问题,因为 SOLR 在查询中使用过滤器的方式。 将有一个学习曲线来开发正确的方法来构建一个应用程序,该应用程序使用 SOLR 进行部分大型查询或有状态修改。
许多高级 Web 框架(Ruby、Hibernate 等)提供的允许高级会话管理和有状态实体的“企业”工具将不得不完全抛弃。
关系数据库旨在处理复杂的数据和关系——因此它们伴随着最先进的指标和自动化分析工具。 在 SOLR 中,我发现自己编写了这样的工具并手动进行了很多压力测试,这可能会浪费时间。
加入:这是大杀手。关系数据库支持构建和优化基于简单谓词连接元组的视图和查询的方法。 在 SOLR 中,没有任何可靠的方法可以跨索引连接数据。
弹性:为了高可用性,SolrCloud 在底层使用分布式文件系统(即 HCFS)。该模型与关系数据库的模型完全不同,关系数据库通常使用从站和主站或 RAID 等来实现弹性。因此,如果您希望它具有云可扩展性和抗性,您必须准备好提供 SOLR 所需的弹性基础设施。
That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).
也就是说 - 对于某些任务,SOLR 有很多明显的优势:(参见http://wiki.apache.org/solr/WhyUseSolr) - 松散查询更容易运行并返回有意义的结果。索引是默认完成的,因此大多数任意查询都非常有效地运行(与 RDBMS 不同,在 RDBMS 中,您通常必须在事后进行优化和反规范化)。
Conclusion:Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.
结论:即使您可以将 SOLR 用作 RDBMS,您可能会发现(正如我所知道的)最终“没有免费的午餐” - 以及超酷的 lucene 文本搜索和高性能内存的成本节省索引,通常是通过降低灵活性和采用新的数据访问工作流来支付的。
回答by Joelio
This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.
这可能是出于性能原因而完成的,如果它不会引起任何问题,我会不理会它。与 solr 索引相比,传统数据库中应该包含的内容存在很大的灰色区域。我似乎人们为 UI 演示做了与此类似的事情(通常是键值对或 json 而不是 xml),并且只有在需要更新/删除时才从数据库中获取真实对象。但是所有读取都只转到 Solr。
回答by Kent Murra
I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.
我见过类似的事情,因为它允许非常快速的查找。我们正在将数据从 Lucene 索引移入快速键值存储,以遵循 DRY 原则并减小索引的大小。这类事情没有硬性规定。
回答by Mauricio Scheffer
It's perfectly reasonable to use Solr as a database, depending on yourapplication. In fact, that's pretty much what guardian.co.uk is doing.
根据您的应用程序,将 Solr 用作数据库是完全合理的。事实上,这几乎就是Guardian.co.uk 正在做的事情。
It's definitely notbad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
这本身绝对不是坏习惯。如果您以错误的方式使用它只会很糟糕,就像任何级别的任何其他工具一样,甚至是 GOTO。
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
当你说“一个 XML 表示......”时,我假设你在谈论有多个存储的 Solr 字段并使用 Solr 的 XML 格式检索它,而不仅仅是一个大的 XML 内容字段(这将是一个糟糕的 Solr 使用) . Solr 使用 XML 作为默认响应格式这一事实在很大程度上无关紧要,您也可以使用二进制协议,因此在这方面它与传统关系数据库相当。
Ultimately, it's up to your application's needs. Solr isprimarily a text search engine, but can also act as a NoSQL database for many applications.
最终,这取决于您的应用程序的需求。Solr的是主要是文本搜索引擎,但也可以作为一个NoSQL的数据库,对于许多应用。
回答by codechefvaibhavkashyap
Adding to @Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.
添加到@Jayunit100 响应中,使用太阳能作为数据库,您以一定的一致性为代价获得可用性和分区容错性。在你写的内容和你什么时候可以读回来之间会有一个可配置的延迟。