xml 如何在数据库中存储文章或其他大文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1084506/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:36:15  来源:igfitidea点击:

How to store articles or other large texts in a database

xmldatabase

提问by Etzeitet

I am currently in the process of designing myself a database driven website. The main reason is for learning purposes but I wont lie, there is a small amount of vanity included!

我目前正在为自己设计一个数据库驱动的网站。主要是为了学习目的,但我不会撒谎,包括少量的虚荣心!

While I believe that my database design is pretty good so far, I am still not entirely sure on the best way of storing articles or other large texts. I know most DBMSs have the TEXT datatype or equivalent and can hold a massive amount of text. However, storing a full article as one long string makes for unhappy reading, so formatting is going to be needed.

虽然我相信到目前为止我的数据库设计相当不错,但我仍然不完全确定存储文章或其他大文本的最佳方式。我知道大多数 DBMS 都具有 TEXT 数据类型或等效数据类型,并且可以容纳大量文本。但是,将整篇文章存储为一个长字符串会使阅读不愉快,因此将需要格式化。

Do I store the article text along with all of the HTML or BBcode tags - or is it better to simply create the page in either a HTML or XML document and store the path to this file in the DB?

我是将文章文本与所有 HTML 或 BBcode 标签一起存储 - 还是简单地在 HTML 或 XML 文档中创建页面并将此文件的路径存储在数据库中更好?

I quite like the idea of storing articles as an XML document as I can easily markup an article with custom tags and use PHP's XML and XSLT functions to transform the XML to HTML [or indeed, any other format]. It also allows the author to dictate when to create line/page breaks. This approach would of course require extra coding [which I am not afraid of] but it does present a problem with making articles searchable.

我非常喜欢将文章存储为 XML 文档的想法,因为我可以轻松地使用自定义标签标记文章,并使用 PHP 的 XML 和 XSLT 函数将 XML 转换为 HTML [或者实际上,任何其他格式]。它还允许作者指定何时创建换行符/分页符。这种方法当然需要额外的编码 [我并不害怕],但它确实存在使文章可搜索的问题。

I know MySQL, for example, has SQL syntax for searching for specific terms/phrases inside strings held in a text field. If I were to store text in separate files, how might I approach making these articles searchable?

例如,我知道 MySQL 具有用于在文本字段中保存的字符串中搜索特定术语/短语的 SQL 语法。如果我要将文本存储在单独的文件中,我将如何使这些文章可搜索?

There is quite a lot I have written here on such a simple question, so I will break it down:

关于这样一个简单的问题,我在这里写了很多,所以我将其分解:

1: Is there a "best" way of storing large amounts of formatted text directly in a database or
2: is it better to hold paths to that text in the form of HTML/XML/Whatever documents.

1:是否有直接在数据库中存储大量格式化文本的“最佳”方法,或者
2:以 HTML/XML/Whatever 文档的形式保存该文本的路径是否更好。

If 2, is there an elegant way of making that text searchable?

如果是 2,是否有一种优雅的方法可以使该文本可搜索?

Thank you for your time :)

感谢您的时间 :)

采纳答案by Byron Whitlock

Store everthing in one big text field as Alex suggested. For searching, don't hammer your database, use Lucene, or htdigto create an index of your output. This way searches are very fast. The side effect is you make your searches a little more search engine friendly; you take your keywords field (as backslash suggested) and stick them in the meta-keywords attribute.

正如亚历克斯建议的那样,将所有内容存储在一个大文本字段中。对于搜索,不要敲打您的数据库,使用Lucenehtdig创建输出的索引。这种方式搜索非常快。副作用是你让你的搜索对搜索引擎更友好;你把你的关键字字段(如反斜杠建议的那样)并将它们粘贴在元关键字属性中。

Edit

编辑

Unless you are only searching keywords, having the db do the searches will be horribly slow (ever searched a forum and it takes FOREVER?). There is no way for the database to index a

除非您只搜索关键字,否则让数据库进行搜索将非常缓慢(曾经搜索过论坛并且需要永远?)。数据库没有办法索引一个

  select.. where FULLTEXTFIELD like '%cookies%'.  

It is frustrating looking for an article and the search doesn't return the results your are looking for because they weren't in the keyword field! Htdig allows you to search the full text of the article efficiently. Your searches will come back instantly, and EVERY term in the article is fully searchable. Putting the keywords in the meta tags will make searches on those terms come higher on the results page.

寻找文章令人沮丧,并且搜索不会返回您正在寻找的结果,因为它们不在关键字字段中!Htdig 可以让您高效地搜索文章全文。您的搜索将立即返回,并且文章中的每个术语都可以完全搜索。将关键字放在元标记中将使对这些术语的搜索在结果页面上更高。

Another benefit is fuzzy matching. If you search for 'activate' htdigg will match pages that have active, activation, activity etc. (configurable). Or if the user misspells a word, it will still be matched. You want your users to have a Google like experience, not an annoying one. :)

另一个好处是模糊匹配。如果您搜索“激活”,htdigg 将匹配具有活动、激活、活动等(可配置)的页面。或者如果用户拼错了一个词,它仍然会被匹配。您希望您的用户拥有类似 Google 的体验,而不是令人讨厌的体验。:)

You do need a script to create a list of links to all your pages from your database. Have htdig crawl this automatically and you never have to think about it again.

您确实需要一个脚本来创建指向数据库中所有页面的链接列表。让 htdig 自动抓取它,您再也不必考虑它了。

Also htdig will crawl your non database pages as well so your whole site is searchable through the same simple interface.

htdig 也会抓取您的非数据库页面,因此您的整个网站都可以通过相同的简单界面进行搜索。

As for the keyword field , you shouldhave a separate table called keywords with the id of the article and a keyword field (1 keyword per row). But for simplicity, having a single field in the db isn't a terrible idea, it makes updating the keywords pretty easy if you put it in a form.

至于关键字字段,您应该有一个名为关键字的单独表,其中包含文章的 id 和关键字字段(每行 1 个关键字)。但为简单起见,在 db 中使用单个字段并不是一个糟糕的主意,如果您将其放入表单中,它会使更新关键字变得非常容易。

If you don't want to fuss with all the hassle of that, you can try using Google custom search. it is far less work, but you have no guarantee that all your pages will get indexed.

如果您不想为这些麻烦事大惊小怪,您可以尝试使用 Google 自定义搜索。它的工作量要少得多,但您不能保证所有页面都会被索引。

Good luck!

祝你好运!

回答by backslash17

The TEXT, BIGTEXT, LONGTEXT and others data types fields were created in order to store large amount of text (64 Kbytes to 4 Gbytes depending of the RDBMS). They just create a binary pointer to locate the text in the database and it is not stored directly in the table. Is almost the same procedure if you store a path in a varchar field to locate the document, but having it in the database makes it easier to maintain because if you delete the row the document disappears with it without the need to delete it in other procedure (as if you stored as a file). Logically this makes your database bigger and sometimes not so easier to backup and transport, but to transport the documents one by one would be tedious and slow.

创建 TEXT、BIGTEXT、LONGTEXT 和其他数据类型字段是为了存储大量文本(64 KB 到 4 GB,具体取决于 RDBMS)。他们只是创建一个二进制指针来定位数据库中的文本,而不是直接存储在表中。如果您将路径存储在 varchar 字段中以定位文档,则几乎是相同的过程,但是将它放在数据库中可以更容易维护,因为如果删除该行,文档将随之消失,而无需在其他过程中将其删除(就像您存储为文件一样)。从逻辑上讲,这会使您的数据库更大,有时备份和传输并不那么容易,但是一个一个地传输文档将是乏味和缓慢的。

As you see it depends on the number of documents and rows in the database.

如您所见,它取决于数据库中的文档数和行数。

For the searching procedure, I recommend to create a new "keywords" field in order to speed your searches. You can search too into the first n characters of the documents too, casting them as a CHAR or VARCHAR and locate the title and subtitle into these amounts if they don't have already a specific field.

对于搜索过程,我建议创建一个新的“关键字”字段以加快搜索速度。您也可以搜索文档的前 n 个字符,将它们转换为 CHAR 或 VARCHAR,如果它们还没有特定字段,则将标题和副标题定位到这些数量中。

回答by Alex Martelli

Depending on how you have arranged and installed everything, it can be hard to access outside files from remote clients that can access the DB just fine -- so why not save all of the XML into one TEXT field instead? You can refactor things to optimize that later if the DB engine can't handle that load well, but that's the easiest way to get started.

根据您安排和安装所有内容的方式,可能很难从可以正常访问数据库的远程客户端访问外部文件——那么为什么不将所有 XML 保存到一个 TEXT 字段中呢?如果数据库引擎不能很好地处理该负载,您可以重构事物以进行优化,但这是最简单的入门方法。

回答by Alex Martelli

Take a quick look at native xml DBs. There are several, and some very good ones are free.

快速浏览一下原生 xml DB。有几个,一些非常好的是免费的。

Search eXist, Document xDB, Oracle Berkeley.

搜索 eXist、文档 xDB、Oracle Berkeley。

If you are persisting, querying and updating semi-structured text and if the structure has any depth at all, you are almost certainly doing it the hard way if you stick with either the RDB of pointers, or stuff-it-in-a-blob techniques -- though there are many exterior reasons that these architectures can be necessary and successful.

如果您要持久化、查询和更新半结构化文本,并且该结构具有任何深度,那么如果您坚持使用指针 RDB 或填充物,那么您几乎肯定会这样做很困难blob 技术——尽管这些架构可能是必要和成功的有很多外部原因。

Do a little reading on XPath and XQuery before you commit to a design. Here's a good place to start: https://community.emc.com/community/edn/xmltech

在提交设计之前,先阅读一些关于 XPath 和 XQuery 的内容。这是一个很好的起点:https: //community.emc.com/community/edn/xmltech