git 对于文本文件的 GitHub 100MB 文件大小限制,是否有任何好的解决方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34723759/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Are there any good workarounds to the GitHub 100MB file size limit for text files?
提问by josteinaj
I have a 190 MB plain text file that I want to track on github.
我有一个 190 MB 的纯文本文件,我想在 github 上跟踪。
The text file is a pronounciation lexicon file for our text-to-speech engine. We regularly add and modify lines in the text files, and the diffs are fairly small, so it's perfect for git in that sense.
文本文件是我们的文本转语音引擎的发音词典文件。我们定期在文本文件中添加和修改行,并且差异相当小,因此从这个意义上说它非常适合 git。
However, GitHub has a strict 100 MB file size limit in place. I have tried the GitHub Large File Storage service, but that uploads a new version of the entire 190 MB file every time it changes - so that would quickly grow to many gigabytes if I go down that path.
但是,GitHub 有严格的 100 MB 文件大小限制。我已经尝试过 GitHub 大文件存储服务,但是每次更改时都会上传整个 190 MB 文件的新版本 - 因此如果我沿着这条路走下去,它会迅速增长到许多 GB。
I would like to keep the file as onefile instead of splitting it because that's how our workflow is currently and it would require some coding to allow multiple text files as input/output in our tools (and we don't have much development resources).
我想将文件保留为一个文件而不是拆分它,因为这就是我们目前的工作流程,它需要一些编码以允许多个文本文件作为我们工具中的输入/输出(并且我们没有太多的开发资源) .
One idea I've had is that maybe it's possible to set up some pre- and post-commit hooks to split and concatenate the big file automatically? Would that be possible?
我的一个想法是,也许可以设置一些提交前和提交后的挂钩来自动拆分和连接大文件?那可能吗?
Other ideas?
其他想法?
Edit: I am aware of the 100 MB file size limitation described in the similar questions here on StackOverflow, but I don't consider my question a duplicate because I'm asking for the specific case where the diffs are small and frequent (I'm not trying to upload a big ZIP file or anything). However, my understanding is that git-lfs is only appropriate for files that rarelychange, and that normal git would be the perfect fit for the kind of file I'm describing; except that GitHub has a file size restriction.
编辑:我知道 StackOverflow 上的类似问题中描述的 100 MB 文件大小限制,但我不认为我的问题是重复的,因为我要求的是差异很小且频繁的特定情况(我'我不想上传一个大的 ZIP 文件或任何东西)。但是,我的理解是 git-lfs 仅适用于很少更改的文件,而普通的 git 将非常适合我所描述的文件类型;除了 GitHub 有文件大小限制。
Update: I spent yesterday experimenting with creating a small cross-platform program that splits and joins files into smaller files using git hooks. It kind of works but not really satisfactory. You will need to have your big text file excluded by .gitignore, which makes git unaware about whether or not it has changed. The split files are not initially detected by git status
or git commit
and leads to the same issue as described in this SO question, which is quite annoying: Pre-commit script creates mysqldump file, but "nothing to commit (working directory clean)"?Setting up a cron job (linux) and scheduled task (windows) to automatically regenerate the split files regularly might fix that, but it's not easy to automatically set up, might cause performance issues on the users computer, and is just not a very elegant solution. Some hacky solutions like dynamically modifying .gitignore might also be needed, and in no way would you get a diff of the actual text files, only the split files (although that might be acceptable as they would be very similar).
更新:我昨天花了一些时间尝试创建一个小的跨平台程序,该程序使用 git hooks 将文件拆分和连接成更小的文件。它有点工作,但不是很令人满意。您需要将大文本文件排除在 .gitignore 之外,这使得 git 不知道它是否已更改。拆分文件最初没有被git status
或检测到,git commit
并导致与此 SO 问题中描述的相同的问题,这很烦人:预提交脚本创建 mysqldump 文件,但“没有提交(工作目录清理)”?设置 cron 作业 (linux) 和计划任务 (windows) 以定期自动重新生成拆分文件可能会解决这个问题,但自动设置并不容易,可能会导致用户计算机出现性能问题,而且不是很优雅解决方案。可能还需要一些像动态修改 .gitignore 之类的hacky解决方案,并且您绝不会获得实际文本文件的差异,只有拆分文件(尽管这可能是可以接受的,因为它们非常相似)。
So, having slept on it, today I think the git hook approach is not a good option after all as it has too many quirks. As has been suggested by @PyRulez, I think I'll have to look at other services than GitHub (unfortunately, since I love github). A hosted solution would be preferable to avoid having to manage our own server. I'd also like it to be publically available...
所以,睡了它,今天我认为 git hook 方法毕竟不是一个好的选择,因为它有太多的怪癖。正如@PyRulez 所建议的那样,我想我将不得不查看 GitHub 以外的其他服务(不幸的是,因为我喜欢 github)。托管解决方案更可取,以避免必须管理我们自己的服务器。我也希望它是公开的...
Update 2: I've looked at some alternatives to GitHub and currently I'm leaning towards using GitLab. I've contacted GitHub support about the possibility of raising the 100MB limit, but if they won't do that I'll just switch to GitLab for this particular project.
更新 2:我已经查看了 GitHub 的一些替代方案,目前我倾向于使用 GitLab。我已经联系了 GitHub 支持关于提高 100MB 限制的可能性,但如果他们不这样做,我将切换到这个特定项目的 GitLab。
回答by PyRulez
Clean and Smudge
清洁和涂抹
You can use clean and smudge to compress your file. Normally, this isn't necessary, since git will compress it internally, but since gitHub is acting weird, it may help. The main commands would be like:
您可以使用 clean 和 smudge 来压缩文件。通常,这不是必需的,因为 git 会在内部对其进行压缩,但是由于 gitHub 的行为很奇怪,它可能会有所帮助。主要命令如下:
git config filter.compress.clean gzip
git config filter.compress.smudge gzip -d
GitHub will see this as a compressed file, but on each computer, it will appear to be a text file.
GitHub 会将其视为压缩文件,但在每台计算机上,它都会显示为文本文件。
See https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributesfor more details.
有关更多详细信息,请参阅https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes。
Alternatively, you could have clean post to an online pastebin, and smudge fetch from the pastebin, such as http://pastebin.com/. Many other combinations are possible with clean and smudge.
或者,您可以将干净的帖子发布到在线 pastebin,并从 pastebin 中获取污点,例如http://pastebin.com/。许多其他组合可以使用清洁和涂抹。
回答by CodeWizard
A very good solution will be to use:
一个非常好的解决方案是使用:
Its an open source designed to work with Large files.
它是一个设计用于处理大文件的开源软件。
回答by Mayuso
You can create a script/program in any language to divide or unite files.
您可以创建任何语言的脚本/程序来分割或合并文件。
Here an example to divide a file written in Java (I used Java because I feel more comfortable on Java than any other, but any other would work, some will be better than Java too).
这是一个分割用 Java 编写的文件的示例(我使用 Java 是因为我觉得 Java 比其他任何一种都更舒服,但其他任何一种都可以工作,有些也会比 Java 更好)。
public static void main(String[] args) throws Exception
{
RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
long numSplits = 10; //from user input, extract it from args
long sourceSize = raf.length();
long bytesPerSplit = sourceSize/numSplits ;
long remainingBytes = sourceSize % numSplits;
int maxReadBufferSize = 8 * 1024; //8KB
for(int destIx=1; destIx <= numSplits; destIx++) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
if(bytesPerSplit > maxReadBufferSize) {
long numReads = bytesPerSplit/maxReadBufferSize;
long numRemainingRead = bytesPerSplit % maxReadBufferSize;
for(int i=0; i<numReads; i++) {
readWrite(raf, bw, maxReadBufferSize);
}
if(numRemainingRead > 0) {
readWrite(raf, bw, numRemainingRead);
}
}else {
readWrite(raf, bw, bytesPerSplit);
}
bw.close();
}
if(remainingBytes > 0) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
readWrite(raf, bw, remainingBytes);
bw.close();
}
raf.close();
}
static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
byte[] buf = new byte[(int) numBytes];
int val = raf.read(buf);
if(val != -1) {
bw.write(buf);
}
}
This will cost almost nothing (Time/Money).
这几乎不会花费任何(时间/金钱)。
Edit:You can create a Java executable and add it to your repository, or even easier, create a Python (Or any other language) script to do this, and save it as plain text on your repository.
编辑:您可以创建一个 Java 可执行文件并将其添加到您的存储库中,或者更简单的是,创建一个 Python(或任何其他语言)脚本来执行此操作,并将其保存为您的存储库中的纯文本。