恢复损坏的zip或者gzip文件?

时间:2020-03-05 18:52:27  来源:igfitidea点击:

破坏压缩文件的最常见方法是无意间执行ASCII模式的FTP传输,这会导致CR和/或者LF字符多对一的破坏。

显然,这会造成信息丢失,解决此问题的最佳方法是以FTP二进制模式再次传输。

但是,如果原始文件丢失了,并且很重要,那么数据的可恢复性如何?

[实际上,我已经知道我认为是最佳答案(很难,但有时可能会在以后发布),以及常见的非答案(用于修复CRC而不修复数据的大量现成程序)。 ,但是我认为在stackoverflow beta期间尝试这个问题,看看是否还有其他人走过成功恢复道路或者发现了我所不知道的工具,这很有趣。

解决方案

回答

我们可以尝试编写一个小脚本以将所有CR替换为CRLF(假设垃圾回收的方向是CRLF到CR),在每个块中随机交换它们,直到获得正确的crc。假设数据不是特别大,我想可能要等到宇宙热死了才完成,才能使用所有的CPU。

由于存在一定的信息丢失,所以我不知道有更好的方法。从CR到CRLF方向的损失可能更容易回滚。

回答

从Bukys软件

Approximately 1 in 256 bytes is known
  to be corrupted, and the corruption is
  known to occur only in bytes with the
  value '2'. So the byte error rate
  is 1/256 (0.39% of input), and 2/256
  bytes (0.78% of input) are suspect.
  But since only three bits per smashed
  byte are affected, the bit error rate
  is only 3/(256*8): 0.15% is bad, 0.29%
  is suspect.
  
  ...
  
  An error in the compressed input
  disrupts the decompression process for
  all subsequent bytes...The fact that
  the decompressed output is
  recognizably bad so quickly is cause
  for hope -- a search for the correct
  answer can identify wrong answers
  quickly.
  
  Ultimately, several techniques were
  combined to successfully extract
  reasonable data from these files:
  
  
  Domain-specific parsing of fields and quoted strings
  Machine learning from previous data with low probability of damage
  Tolerance for file damage due to other causes (e.g. disk full while
  logging)
  Lookahead for guiding the search along the highest-probability paths
  
  
  These techniques identify 75% of the
  necessary repairs with certainty, and
  the remainder are explored
  highest-probability-first, so that
  plausible reconstructions are
  identified immediately.