windows 如何快速创建具有“自然”内容的大型(> 1gb)文本+二进制文件?(C#)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1037719/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 12:42:07  来源:igfitidea点击:

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

c#.netwindowstestingfilesystems

提问by Cheeso

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.

为了测试压缩,我需要能够创建大文件,最好是文本、二进制和混合格式。

  • The content of the files should be neither completely random nor uniform.
    A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
  • The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
  • I'd like to keep the number of files at a manageable level, let's say o(10).
  • 文件的内容既不能完全随机,也不能统一。
    全零的二进制文件是不好的。具有完全随机数据的二进制文件也不好。对于文本,具有完全随机 ASCII 序列的文件并不好 - 文本文件应该具有模拟自然语言或源代码(XML、C# 等)的模式和频率。伪真实文本。
  • 每个单独文件的大小并不重要,但对于文件集,我需要总数为 ~8gb。
  • 我想将文件数量保持在可管理的水平,比如 o(10)。

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:

为了创建二进制文件,我可以新建一个大缓冲区并在循环中执行 System.Random.NextBytes 和 FileStream.Write,如下所示:

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.

使用足够大的缓冲区,假设为 512k,这相对较快,即使对于超过 2 或 3GB 的文件也是如此。但是内容完全是随机的,这不是我想要的。

For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.

对于文本文件,我采用的方法是使用Lorem Ipsum,并通过 StreamWriter 重复将其发送到文本文件中。内容是非随机的、不均匀的,但确实有很多相同的重复块,不自然。此外,由于 Lorem Ispum 块非常小 (<1k),它需要很多次循环和非常非常长的时间。

Neither of these is quite satisfactory for me.

这些都不是我很满意的。

I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.

我看过在Windows 系统上快速创建大文件的答案吗?. 这些方法非常快,但我认为它们只是用零或随机数据填充文件,这两者都不是我想要的。如有必要,我可以运行像 contig 或 fsutil 这样的外部进程。

The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.

测试在 Windows 上运行。
与创建新文件相比,仅使用文件系统中已存在的文件是否更有意义?我不知道任何足够大的。

What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.

从单个现有文件(对于文本文件可能是 c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch)开始并多次复制其内容怎么样?这适用于文本或二进制文件。

Currently I have an approach that sort of works but it takes too long to run.

目前我有一种有效的方法,但运行时间太长。

Has anyone else solved this?

有没有其他人解决过这个问题?

Is there a much faster way to write a text file than via StreamWriter?

有没有比通过 StreamWriter 更快地编写文本文件的方法?

Suggestions?

建议?

EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

编辑:我喜欢马尔可夫链的想法来产生更自然的文本。不过,仍然需要面对速度问题。

采纳答案by Noldorin

I think you might be looking for something like a Markov chainprocess to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.

我认为您可能正在寻找类似马尔可夫链过程的东西来生成这些数据。它既是随机的(随机的),又是结构化的,因为它基于有限状态机运行

Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chainssection of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

事实上,马尔可夫链已被用于生成人类语言中看起来半真实的文本。一般来说,正确分析它们并不是微不足道的事情,但它们表现出某些属性的事实对您来说应该足够了。(同样,请参阅页面的马尔可夫链的属性部分。)希望您应该了解如何设计一个,然而 - 要实现,它实际上是一个非常简单的概念。您最好的选择可能是为通用马尔可夫过程创建一个框架,然后分析自然语言或源代码(无论您希望随机数据模拟哪个)以“训练”您的马尔可夫过程。最后,这应该会根据您的要求为您提供非常高质量的数据。如果您需要这些大量的测试数据,那么值得付出努力。

回答by Sam Saffron

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).

对于文本,您可以使用堆栈溢出社区转储,那里有 300megs 的数据。将我编写的应用程序加载到数据库中只需要大约 6 分钟,并且可能大约在同一时间将所有帖子转储到文本文件中,这很容易为您提供 20 万到 100 万个文本文件,具体取决于您的方法(还有将源代码和 xml 混合在一起的额外好处)。

You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.

您还可以使用诸如wikipedia dump 之类的东西,它似乎以 MySQL 格式提供,这将使其非常易于使用。

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.

如果您正在寻找可以拆分的大文件,出于二进制目的,您可以使用 VM vmdk 或本地翻录的 DVD。

Edit

编辑

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

马克提到了古腾堡下载项目,这也是一个非常好的文本(和音频)来源,可以通过 bittorrent 下载

回答by Benjol

You could always code yourself a little web crawler...

你总是可以自己编写一个小的网络爬虫......

UPDATECalm down guys, this wouldbe a good answer, ifhe hadn't said that he already had a solution that "takes too long".

更新冷静点,这是一个很好的答案,如果他没有说他已经有一个“需要太长时间”的解决方案。

A quick check herewould appear to indicate that downloading 8GB of anything would take a relatively long time.

这里的快速检查似乎表明下载 8GB 的​​任何东西都需要相对较长的时间。

回答by Kirschstein

I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.

我认为 Windows 目录可能足以满足您的需求。如果你在寻找文本,我会在每个目录中递归查找 .txt 文件并循环遍历它们,根据需要多次将它们复制到输出文件以获得正确大小的文件。

You could then use a similiar approach for binary files by looking for .exes or .dlls.

然后,您可以通过查找 .exes 或 .dlls 对二进制文件使用类似的方法。

回答by R Ubben

Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize sitecan provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.

维基百科非常适合混合文本和二进制的压缩测试。如果您需要基准比较,Hutter Prize 站点可以为维基百科的前 100 mb 提供高水位标记。当前的记录是 6.26 的比率,16 mb。

回答by Hyman Ryan

For text files you might have some success taking an english word listand simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.

对于文本文件,您可能会成功获取英文单词列表并随机从中提取单词。这不会产生真正的英文文本,但我猜它会产生类似于你在英文中可能找到的字母频率。

For a more structured approach you could use a Markov chaintrained on some large free english text.

对于更结构化的方法,您可以使用在一些大型免费英文文本上训练的马尔可夫链

回答by kemiller2002

Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.

为什么不直接使用 Lorem Ipsum 并在输出之前在内存中创建一个长字符串。如果每次将文本量加倍,文本应该以 O(log n) 的速率扩展。您甚至可以事先计算数据的总长度,这样您就不必将内容复制到新的字符串/数组中。

Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.

由于您的缓冲区只有 512k 或您设置的任何大小,因此您只需要在写入之前生成那么多数据,因为这只是您一次可以推送到文件的数量。您将一遍又一遍地编写相同的文本,因此只需使用您第一次创建的原始 512k。

回答by Cheeso

Thanks for all the quick input. I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.

感谢所有快速输入。我决定分别考虑速度和“自然”的问题。为了生成自然的文本,我结合了几个想法。

  • To generate text, I start with a few text files from the project gutenbergcatalog, as suggested by Mark Rushakoff.
  • I randomly select and download one document of that subset.
  • I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
  • I wrote a new Markov Chain in C# using Pike's economical Perl implementationas an example. It generates a text one word at a time.
  • For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.
  • 按照 Mark Rushakoff 的建议,为了生成文本,我从项目古腾堡目录中的几个文本文件开始。
  • 我随机选择并下载该子集的一个文档。
  • 然后我按照Noldorin 的建议应用马尔可夫过程,使用下载的文本作为输入。
  • 我以Pike 经济的 Perl 实现为例,用 C# 编写了一个新的马尔可夫链。它一次一个单词生成一个文本。
  • 为了提高效率,该代码不是使用纯马尔可夫链一次一个单词生成 1gb 的文本,而是生成约 1mb 的随机文本,然后重复获取其中的随机片段并将它们组合在一起。

UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILEwith random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.

更新:至于第二个问题,速度 - 我采取了尽可能消除 IO 的方法,这是在我的带有 5400rpm 迷你主轴的糟糕笔记本电脑上完成的。这让我完全重新定义了这个问题——我真正想要的是随机内容,而不是生成一个包含随机内容的文件。使用环绕马尔可夫链的 Stream,我可以在内存中生成文本并将其流式传输到压缩器,从而减少 8g 写入和 8g 读取。对于这个特定的测试,我不需要验证压缩/解压往返,所以我不需要保留原始内容。因此,流媒体方法可以很好地大大加快速度。它减少了所需时间的 80%。

I haven't yet figured out how to do the binary generation, but it will likely be something analogous.

我还没有弄清楚如何进行二进制生成,但它可能是类似的。

Thank you all, again, for all the helpful ideas.

再次感谢大家提供的所有有用的想法。