Html 在单个存档中保存完整网页(图像等)的最佳“文件格式”是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/260058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 22:39:53  来源:igfitidea点击:

What's the best "file format" for saving complete web pages (images, etc.) in a single archive?

htmlstandardswebpagearchive

提问by Marco

I'm working on a project which stores single images and text files in one place, like a time capsule. Now, most every project can be saved as one file, like DOC, PPT, and ODF. But complete web pages can't-- they're saved as a separate HTML file and data folder. I want to save a web page in a single archive, and while there are several solutions, there's no "standard". Which is the best format for HTML archives?

我正在开发一个将单个图像和文本文件存储在一个地方的项目,就像一个时间胶囊。现在,几乎每个项目都可以保存为一个文件,如 DOC、PPT 和 ODF。但是完整的网页不能——它们被保存为单独的 HTML 文件和数据文件夹。我想将网页保存在单个存档中,虽然有多种解决方案,但没有“标准”。HTML 档案的最佳格式是什么?

  • Microsoft has MHTML-- basically a file encoded exactly as a MIME HTML email message. It's already based on an existing standard, and MHTML as its own was proposed as rfc2557. This is a great idea and it's been around forever, except it's been a "proposed standard" since 1999. Plus, implementations other than IE's are just cumbersome. IE and Opera support it; Firefox and Safari with a cumbersome extension.

  • Mozilla has Mozilla Archive Format-- basically a ZIP file with the markup and images, with metadata saved as RDF. It's an awesome idea -- Winamp does this for skins, and ODF and OOXML for their embedded images. I love this, except, 1. Nobody else except Mozilla uses it, 2. The only extension supporting it wasn't updated since Firefox 1.5.

  • Data URIsare becoming more popular. Instead of referencing an external location a la MHTML or MAF, you encode the file straight into the HTML markup as base64. Depending on your view, it's streamlined since the files are rightwhere the markup is. However, support is still somewhat weak. Firefox, Opera, and Safari support it without gaffes; IE, the market leader, only started supporting it at IE8, and even then with limits.

  • Then of course, there's "Save complete webpage"where the HTML markup is saved as "savedpage.html"and the files in a separate "savedpage_files"folder. Afaik, everyone does this. It's well supported. But having to handle two separate elements is not simple and streamlined at all. My project needs to have them in a single archive.

  • Microsoft 有MHTML—— 基本上是一个完全编码为 MIME HTML 电子邮件消息的文件。它已经基于现有标准,并且 MHTML 作为它自己的标准被提议为rfc2557。这是一个好主意,它一直存在,除了它自 1999 年以来一直是“提议的标准”。另外,除 IE 之外的实现都非常麻烦。IE 和 Opera 支持;带有繁琐扩展的 Firefox 和 Safari。

  • Mozilla 有Mozilla 存档格式——基本上是一个带有标记和图像的 ZIP 文件,元数据保存为 RDF。这是一个很棒的想法——Winamp 为皮肤做这件事,而 ODF 和 OOXML 则为他们的嵌入图像做这件事。我喜欢这个,除了, 1. 除了 Mozilla 没有人使用它, 2. 自 Firefox 1.5 以来,唯一支持它的扩展没有更新。

  • 数据 URI正变得越来越流行。您不是通过 MHTML 或 MAF 引用外部位置,而是将文件直接编码到 HTML 标记中作为 base64。根据您的观点,它已被简化,因为文件在标记所在的位置。然而,支撑仍然有些疲软。Firefox、Opera 和 Safari 支持它而不会出错;市场领导者IE仅在 IE8 上才开始支持它,即使那时也有限制。

  • 当然,还有“保存完整网页”,其中 HTML 标记保存为"savedpage.html"单独的"savedpage_files"文件夹中的文件。Afaik,每个人都这样做。它得到了很好的支持。但必须处理两个独立的元素不是简单和流线型所有。我的项目需要将它们放在一个存档中

Keeping in mind browser supportand ease of editing the page, what do you think's the best way to save web pages in a single archive?What would be best as a "standard"? Or should I just buckle down and deal with the HTML file and separate folder? For the sake of my project, I couldsupport that, but I'd best avoid it.

记住浏览器支持轻松编辑页面您认为将网页保存在单个存档中的最佳方式是什么?什么是最好的“标准”?或者我应该直接扣紧并处理 HTML 文件和单独的文件夹?为了我的项目,我可以支持,但我最好避免它。

采纳答案by Treb

My favourite is the ZIP format. Because:

我最喜欢的是 ZIP 格式。因为:

  • It is very well sutied for the purpose
  • It is well documented
  • There a a lot of implementations available for creating or reading them
  • A user can easily extract single files, change them and put them back in the archive
  • Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in
  • 它非常适合这个目的
  • 有据可查
  • 有很多实现可用于创建或读取它们
  • 用户可以轻松地提取单个文件,更改它们并将它们放回存档中
  • 几乎每个主要操作系统(Windows、Mac 和大多数 linux)都内置了 ZIP 程序

The alternatives all have some flaw:

替代方案都有一些缺陷:

  • With MHTMl, you can not easily edit.
  • With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
  • The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.
  • 使用 MHTMl,您无法轻松编辑。
  • 使用数据 URI,我不知道实现会有多困难。(使用 ZIP,即使我在 PHP 中也能做到,3 年前……)
  • 将内容存储为单独文件的选项太多了,可能会出错并弄乱您的存档。

回答by Joel Anair

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

几乎所有平台上的几乎所有浏览器都支持 PDF,并将内容和图像存储在一个文件中。它们可以使用正确的工具进行编辑。这几乎绝对不是理想的,但这是一个可以考虑的选择。

回答by Espinosa

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:

这不仅仅是文件格式的问题。另一个关键问题是您到底要存储什么?是吗:

  1. store whole page as it is with all referenced resources - images, CSS and javascript?

  2. to capture page as it was rendered at some point in time; a static image of some rendered state of web page DOM?

  1. 将整个页面与所有引用的资源(图像、CSS 和 javascript)一起存储?

  2. 捕获在某个时间点呈现的页面;网页 DOM 某些呈现状态的静态图像?

Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.

当前浏览器中的大多数“页面另存为”功能,无论是 MAF 或 MHTML 还是文件+目录,都尝试第一种方式。这最终是有缺陷的方法。

Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:

不要忘记网页,那里的网页是本地应用程序,而不是您可以轻松存储的静态文档。潜在问题:

  1. one page is in fact several pages build dynamically by JS, user interaction is needed to get it to desired state

  2. AJAX applications can do remote communication with remote service rendering it unusable for offline view.

  3. Hidden links in javascript code. Such resource is then not part of stored page. Even parsing JS code may not discover them. You need to run the code.

  4. Even position of basic html elements may be recomputed may be computed dynamically by JS and it is not always possible/easy to recreate it locally.

  5. You would need some sort of JS memory dump and load this to get page to desired state you hoped to store

  1. 一个页面实际上是由JS动态构建的多个页面,需要用户交互才能使其达到所需状态

  2. AJAX 应用程序可以与远程服务进行远程通信,使其无法用于离线查看。

  3. javascript 代码中的隐藏链接。这样的资源不是存储页面的一部分。即使解析 JS 代码也可能无法发现它们。您需要运行代码。

  4. 甚至可以重新计算基本 html 元素的位置也可以由 JS 动态计算,并且在本地重新创建它并不总是可能/容易。

  5. 您需要某种 JS 内存转储并加载它以使页面达到您希望存储的所需状态

And many many more issues...

还有更多的问题......

Check Chrome SingleFileextension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

检查 Chrome SingleFile扩展。它将网页存储到一个 html 文件中,其中使用已经提到的数据 URI 内联图像。我没有对它进行太多测试,所以我不能说它处理“不稳定”ajax 页面的效果如何。

回答by Shadow2531

Use a zip file.

使用 zip 文件。

You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.

您始终可以制作一个程序/脚本,将 zip 文件解压缩到临时目录并在浏览器中加载 index.html 文件。您甚至可以使用 index.ini/txt 文件来指定提取时应加载的文件。

Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.

基本上,您需要类似 Mozilla Archive 格式的东西,但没有不必要的 rdf 废话,只是为了指定要加载的文件。

MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.

MHT文件很好,但它们通常使用base64来嵌入文件,这会使文件大小比应有的大(数据URI也是如此)。您可以将附件添加为二进制文件,但您必须使用十六进制编辑器手动执行此操作或创建一个工具,并且客户对它的支持可能没有那么好。

Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

当然,如果你想使用浏览器生成的东西,MHT(至少是Opera和IE)可能会更好。

回答by Javier

i see no excuse to use anything other than a zipfile

我认为没有理由使用 zipfile 以外的任何东西

回答by Vinko Vrsalovic

Well, if browser support and ease of editing are the biggest concerns I think you are stuck with the file+directory approach unless you are willing to provide an editor for the single file format and live with not very good support in browsers.

好吧,如果浏览器支持和易于编辑是最大的问题,我认为除非您愿意为单一文件格式提供编辑器并且在浏览器中没有很好的支持,否则我认为您会坚持使用文件+目录方法。

You can create a single file by compressing the contents. You can also create a parent directory to ease handling.

您可以通过压缩内容来创建单个文件。您还可以创建一个父目录以简化处理。

回答by Devon Carter

The problem is that html is bottoms up not top down. Look at your file name which saved on my box as "What's the best "file format" for saving complete web pages (images, etc.) in a single archive? - Stack Overflow.html"

问题是 html 是自下而上而不是自上而下。看看你的文件名,它保存在我的盒子里,作为“在单个存档中保存完整网页(图像等)的最佳“文件格式”是什么? - Stack Overflow.html”

Just add a '|' and one has trouble doing copy and paste backups to a spare drive. In the end you end up. chopping the file name in order to save it. Dozens/ perhaps hundreds of identical index.html or index.php are cluttering my drives.

只需添加一个“|” 并且在将备份复制和粘贴到备用驱动器时遇到问题。最后你结束了。砍文件名以保存它。数十个/也许数百个相同的 index.html 或 index.php 弄乱了我的驱动器。

The partial solution is to write you own CMS and use scripts to map all relevant files to a flat file database - then use fileName, size, mtime and md5 to get a unique Id for each file. Create a flat file index permitting 100k or 1000k records. The goal is to write once and use many times. So you need a real CMS you need a unique id based on content (eg index8765432.html) that goes in your files_archive. Ditto for the others. Then you can non-destructively symlink from the saved original html to the files_archive and just recreate the file using a php or alternative script if need be. Don't know if it will work as I'm at the same point you're at - maybe in a week will know for sure. The more useful approach is to have a top down structure based on your business or personal wants and related tasks. So your files might be organized top down but external ones bottom up to preserve the original content. My interest is in Web 3.0 services and the closer you get to machine to machine interaction the greater the need to structure the information. Maybe time to rethink the idea of bundling everything into a single file. So you have hundreds of main.css why bundle when a top down solution might let you modify one file instead of hundreds.

部分解决方案是编写您自己的 CMS 并使用脚本将所有相关文件映射到平面文件数据库 - 然后使用 fileName、size、mtime 和 md5 为每个文件获取唯一的 Id。创建一个允许 100k 或 1000k 记录的平面文件索引。目标是编写一次并使用多次。因此,您需要一个真正的 CMS,您需要一个基于 files_archive 中的内容(例如 index8765432.html)的唯一 ID。其他人同上。然后,您可以从保存的原始 html 非破坏性地符号链接到 files_archive,如果需要,只需使用 php 或替代脚本重新创建文件。不知道它是否会像我在你所处的同一点一样工作 - 也许一周后就会知道。更有用的方法是根据您的业务或个人需求和相关任务建立自上而下的结构。因此,您的文件可能自上而下组织,但外部文件自下而上以保留原始内容。我的兴趣是 Web 3.0 服务,您越接近机器与机器的交互,就越需要构建信息。也许是时候重新考虑将所有内容捆绑到一个文件中的想法了。所以当一个自上而下的解决方案可能让你修改一个文件而不是数百个时,你有数百个 main.css 为什么要捆绑。