git 单独数据分析师的 R 和版本控制
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2712421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
R and version control for the solo data analyst
提问by Jeromy Anglim
Many data analysts that I respect use version control. For example:
我尊重的许多数据分析师都使用版本控制。例如:
- http://github.com/hadley/
- See comments on http://permut.wordpress.com/2010/04/21/revision-control-statistics-bleg/
- http://github.com/hadley/
- 请参阅http://permut.wordpress.com/2010/04/21/revision-control-statistics-bleg/ 上的评论
However, I'm evaluating whether adopting a version control system such as git would be worthwhile.
但是,我正在评估是否值得采用 git 等版本控制系统。
A brief overview:I'm a social scientist who uses R to analyse data for research publications. I don't currently produce R packages. My R code for a project typically includes a few thousand lines of code for data input, cleaning, manipulation, analyses, and output generation. Publications are typically written using LaTeX.
简要概述:我是一名社会科学家,使用 R 分析研究出版物的数据。我目前不生产 R 包。我的项目 R 代码通常包括几千行代码,用于数据输入、清理、操作、分析和输出生成。出版物通常使用 LaTeX 编写。
With regards to version control there are many benefits which I have read about, yet they seem to be less relevant to the solo data analyst.
关于版本控制,我已经阅读了许多好处,但它们似乎与单独的数据分析师不太相关。
- Backup:I have a backup system already in place.
- Forking and rewinding:I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)
- Collaboration:Most of the time I am analysing data myself, thus, I wouldn't get the collaboration benefits of version control.
- 备份:我已经有一个备份系统。
- 分叉和倒带:我从来没有觉得有必要这样做,但我可以看到它是如何有用的(例如,您正在准备基于相同数据集的多篇期刊文章;您正在准备每月更新的报告等)
- 协作:大部分时间我都是自己分析数据,因此,我无法获得版本控制的协作优势。
There are also several potential costs involved with adopting version control:
采用版本控制还涉及一些潜在的成本:
- Time to evaluate and learn a version control system
- A possible increase in complexity over my current file management system
- 是时候评估和学习版本控制系统了
- 我当前的文件管理系统可能会增加复杂性
However, I still have the feeling that I'm missing something. General guides on version control seem to be addressed more towards computer scientists than data analysts.
但是,我仍然有一种感觉,我错过了一些东西。版本控制的一般指南似乎更多地针对计算机科学家而不是数据分析师。
Thus, specifically in relation to data analystsin circumstances similar to those listed above:
因此,特别是在与上述情况类似的情况下与数据分析师相关:
- Is version control worth the effort?
- What are the main pros and cons of adopting version control?
- What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?
- 版本控制值得付出努力吗?
- 采用版本控制的主要优缺点是什么?
- 开始使用 R 进行数据分析的版本控制的好策略是什么(例如,示例、工作流想法、软件、指南链接)?
采纳答案by Sharpie
I feel the answer to your question is a resounding yes- the benefits of managing your files with a version control system far outweigh the costs of implementing such a system.
我觉得你的问题的答案是肯定的 - 使用版本控制系统管理文件的好处远远超过实施这样一个系统的成本。
I will try to respond in detail to some of the points you raised:
我会尽量详细回应你提出的一些观点:
- Backup:I have a backup system already in place.
- 备份:我已经有一个备份系统。
Yes, and so do I. However, there are some questions to consider regarding the appropriateness of relying on a general purpose backup system to adequately track important and active files relating to your work. On the performance side:
是的,我也是。但是,关于依赖通用备份系统来充分跟踪与您的工作相关的重要和活动文件的适当性,需要考虑一些问题。在性能方面:
- At what interval does your backup system take snapshots?
- How long does it take to build a snapshot?
- Does it have to image your entire hard drive when taking a snapshot, or could it be easily told to just back up two files that just received critical updates?
- Can your backup system show you, with pinpoint accuracy, what changed in your text files from one backup to the next?
- 您的备份系统以什么时间间隔拍摄快照?
- 构建快照需要多长时间?
- 拍摄快照时是否必须对整个硬盘进行映像,或者是否可以轻松地告诉它只备份刚刚收到关键更新的两个文件?
- 您的备份系统能否准确地向您显示从一次备份到下一次备份的文本文件发生了什么变化?
And most importantly:
而最重要的是:
- How many locations are the backups saved in? Are they in the same physical location as your computer?
- How easy is it to restore a given version of a single file from your backup system?
- 备份保存在多少个位置?它们是否与您的计算机位于同一物理位置?
- 从备份系统恢复单个文件的给定版本有多容易?
For example, have a Mac and use Time Machine to backup to another hard drive in my computer. Time Machine is great for recovering the odd file or restoring my system if things get messed up. However it simply doesn't have what it takes to be trusted with my important work:
例如,有一台 Mac 并使用 Time Machine 备份到我电脑中的另一个硬盘驱动器。Time Machine 非常适合恢复奇怪的文件或在出现问题时恢复我的系统。然而,它根本不具备信任我的重要工作所需的条件:
When backing up, Time Machine has to image the whole hard drive which takes a considerable amount of time. If I continue working, there is no guarantee that my file will be captured in the state that it was when I initiated the backup. I also may reach another point I would like to save before the first backup finishes.
The hard drive to which my Time Machine backups are saved is located in my machine- this makes my data vulnerable to theft, fire and other disasters.
备份时,Time Machine 必须对整个硬盘进行映像,这需要花费大量时间。如果我继续工作,则无法保证我的文件会以我启动备份时的状态被捕获。在第一次备份完成之前,我也可能会达到另一个想要保存的点。
保存我的 Time Machine 备份的硬盘驱动器位于我的机器中 - 这使我的数据容易受到盗窃、火灾和其他灾难的影响。
With a version control system like Git, I can initiate a backup of specific files with no more effort that requesting a save in a text editor- and the file is imaged and stored instantaneously. Furthermore, Git is distributed so each computer that I work at has a full copy of the repository.
使用像 Git 这样的版本控制系统,我可以启动特定文件的备份,而无需在文本编辑器中请求保存更多的工作 - 并且文件被即时成像和存储。此外,Git 是分布式的,所以我工作的每台计算机都有一个完整的存储库副本。
This amounts to having my work mirrored across four different computers- nothing short of an act of god could destroy my files and data, at which point I probably wouldn't care too much anyway.
这相当于让我的工作在四台不同的计算机上进行镜像 - 没有什么可以破坏我的文件和数据的天作之合,在这一点上我可能不会太在意。
- Forking and rewinding:I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)
- 分叉和倒带:我从来没有觉得有必要这样做,但我可以看到它是如何有用的(例如,您正在准备基于相同数据集的多篇期刊文章;您正在准备每月更新的报告等)
As a soloist, I don't fork that much either. However, the time I have saved by having the option to rewind has single-handedly paid back my investment in learning a version control system many, many times. You say you have never felt the need to do this- but has rewinding any file under your current backup system really been a painless, feasible option?
作为一个独奏者,我也没有那么多分叉。然而,我通过选择倒带而节省的时间,单枪匹马地多次回报了我学习版本控制系统的投资。你说你从来没有觉得有必要这样做——但是在你当前的备份系统下倒带任何文件真的是一个轻松、可行的选择吗?
Sometimes the report just looked better 45 minutes, an hour or two days ago.
有时,报告在 45 分钟、一小时或两天前看起来更好。
- Collaboration:Most of the time I am analysing data myself, thus, I wouldn't get the collaboration benefits of version control.
- 协作:大部分时间我都是自己分析数据,因此,我无法获得版本控制的协作优势。
Yes, but you would learn a tool that may prove to be indispensable if you do end up collaborating with others on a project.
是的,但是如果您最终与其他人在项目上进行合作,您将学习一种可能被证明是必不可少的工具。
- Time to evaluate and learn a version control system
- 是时候评估和学习版本控制系统了
Don't worry too much about this. Version control systems are like programming languages- they have a few key concepts that need to be learned and the rest is just syntactic sugar. Basically, the first version control system you learn will require investing the most time- switching to another one just requires learning how the new system expresses key concepts.
不要太担心这个。版本控制系统就像编程语言——它们有一些需要学习的关键概念,其余的只是语法糖。基本上,您学习的第一个版本控制系统需要投入最多的时间——切换到另一个版本控制系统只需要了解新系统如何表达关键概念。
Pick a popular system and go for it!
选择一个流行的系统并开始吧!
- A possible increase in complexity over my current file management system
- 我当前的文件管理系统可能会增加复杂性
Do you have one folder, say Projects
that contains all the folders and files related to your data analysis activities? If so then slapping version control on it is going to increase the complexity of your file system by exactly 0
. If your projects are strewn about your computer- then you should centralize them before applying version control and this will end up decreasingthe complexity of managing your files- that's why we have a Documents
folder after all.
您是否有一个文件夹,比如Projects
包含与您的数据分析活动相关的所有文件夹和文件?如果是这样,那么对它进行版本控制将会增加文件系统的复杂性0
。如果您的项目散布在您的计算机上——那么您应该在应用版本控制之前将它们集中起来,这最终会降低管理Documents
文件的复杂性——这就是我们毕竟有一个文件夹的原因。
- Is version control worth the effort?
- 版本控制值得付出努力吗?
Yes! It gives you a huge undo button and allows you to easily transfer work from machine to machine without worrying about things like losing your USB drive.
是的!它为您提供了一个巨大的撤消按钮,让您可以轻松地将工作从一台机器转移到另一台机器,而无需担心丢失 USB 驱动器之类的事情。
2 What are the main pros and cons of adopting version control?
2 采用版本控制的主要优缺点是什么?
The only con I can think of is a slight increase in file size- but modern version control systems can do absolutely amazing things with compression and selective saving so this is pretty much a moot point.
我能想到的唯一缺点是文件大小略有增加——但现代版本控制系统可以通过压缩和选择性保存来做绝对令人惊叹的事情,所以这几乎是一个有争议的问题。
3 What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?
3 开始使用 R 进行数据分析的版本控制的好策略是什么(例如,示例、工作流想法、软件、指南链接)?
Keep files that generate data or reports under version control, be selective. If you are using something like Sweave
, store your .Rnw
files and not the .tex
files that get produced from them. Store raw data if it would be a pain to re-acquire. If possible, write and store a script that acquires your data and another that cleans or modifies it rather than storing changes to raw data.
将生成数据或报告的文件置于版本控制之下,要有选择性。如果您正在使用类似的东西Sweave
,请存储您的.Rnw
文件而不是.tex
从中生成的文件。如果重新获取很麻烦,请存储原始数据。如果可能,编写并存储一个获取数据的脚本和另一个清理或修改数据的脚本,而不是存储对原始数据的更改。
As for learning a version control system, I highly recommend Git and this guideto it.
至于学习版本控制系统,我强烈推荐 Git 和本指南。
These websites also have some nice tips and tricks related to performing specific actions with Git:
这些网站还有一些与使用 Git 执行特定操作相关的不错的提示和技巧:
回答by Dan Menes
I worked for nine years in an analytics shop, and introduced the idea of version control for our analysis projects to that shop. I'm a big believer in version control, obviously. I would make the following points, however.
我在一家分析店工作了九年,并向那家店介绍了我们分析项目的版本控制理念。显然,我是版本控制的忠实信徒。不过,我会提出以下几点。
- Version control may not be appropriate if you are doing analysis for possible use in court. It doesn't sound like this applies to you, but it would have made our clients very nervous to know that every version of every script that we had ever produced was potentially discoverable. We used version control for code modules that were reused in multiple engagements, but did not use version control for engagement-specific code, for that reason.
- We found the biggest benefit to version control came from storing canned modules of code that were re-used across multiple projects. For example, you might have a particular favorite way of processing certain Census PUMS extracts. Organize this code into a directory and put it into your VCS. You can then check it out into each new project every time you need it. It may even be useful to create specific branches of certain code for certain project, if you are doing special processing of a particular common dataset for that project. Then, when you are done with that project, decide how much of your special code to merge back to the main branch.
- Don't put processed data into version control. Only code. Our goal was always to have a complete set of scripts so that we could delete all of our internally processed data, push a button, and have every number for the report regenerated from scratch. That's the only way to be sure that you don't have old bugs living on mysteriously in your data.
- To make sure that your results are really completely reproducible, it isn't sufficient just to keep your code in a VCS. It is critical to keep careful track of which version of which modules were used to create any particular deliverable.
- As for software, I had good luck with Subversion. It is easy to set up and administer. I recognize the appeal of the new-fangled distributed VCSs, like git and mercurial, but I'm not sure there are any strong advantages if you are working by yourself. On the other hand, I don't know of any negatives to using them, either--I just haven't worked with them in an analysis environment.
- 如果您正在为可能在法庭上使用而进行分析,则版本控制可能不合适。这听起来并不适用于您,但是如果我们的客户知道我们曾经制作的每个脚本的每个版本都有可能被发现,就会让我们的客户非常紧张。出于这个原因,我们对在多次参与中重用的代码模块使用了版本控制,但没有对特定于参与的代码使用版本控制。
- 我们发现版本控制的最大好处来自存储可在多个项目中重复使用的预制代码模块。例如,您可能有一种特别喜欢的方式来处理某些人口普查 PUMS 提取物。将此代码组织到一个目录中并将其放入您的 VCS。然后,您可以在每次需要时将其检出到每个新项目中。如果您正在对该项目的特定公共数据集进行特殊处理,则为特定项目创建特定代码的特定分支甚至可能很有用。然后,当您完成该项目时,决定将多少特殊代码合并回主分支。
- 不要将处理过的数据放入版本控制中。只有代码。我们的目标始终是拥有一套完整的脚本,以便我们可以删除所有内部处理的数据,按下按钮,并从头开始重新生成报告的每个数字。这是确保您的数据中没有神秘存在的旧错误的唯一方法。
- 为了确保您的结果真的完全可重现,仅将您的代码保存在 VCS 中是不够的。仔细跟踪用于创建任何特定可交付成果的模块的哪个版本至关重要。
- 至于软件,我很幸运使用 Subversion。它易于设置和管理。我认识到新奇的分布式 VCS 的吸引力,如 git 和 mercurial,但我不确定如果您自己工作,是否有任何强大的优势。另一方面,我也不知道使用它们有什么负面影响——我只是没有在分析环境中使用它们。
回答by Jeromy Anglim
For the sake of completeness, I thought I'd provide an update on my adoption of version control.
为了完整起见,我想我会提供有关我采用版本控制的更新。
I have found version control for solo data analysis projects to be very useful.
我发现单独数据分析项目的版本控制非常有用。
I've adopted git as my main version control tool. I first starteed using Egit within Eclipse with StatET. Now I generally just use the command-line interface, although integration with RStudio is quite good.
我已经采用 git 作为我的主要版本控制工具。我首先开始在 Eclipse 中通过 StatET 使用 Egit。现在我通常只使用命令行界面,尽管与 RStudio 的集成非常好。
I've blogged about my experience getting set up with version controlfrom the perspective of data analysis projects.
我已经在博客中介绍了我从数据分析项目的角度设置版本控制的经验。
As stated in the post, I've found adopting version control has had many secondary benefits in how I think about data analysis projects including clarifying:
正如帖子中所述,我发现采用版本控制在我如何看待数据分析项目方面有许多次要的好处,包括澄清:
- the distinction between source and derived files
- the nature of dependencies:
- dependencies between elements of code
- dependencies between files within a project
- and dependencies with files and programs external to the repository
- the nature of a repository and how repositories should be divided
- the nature of committing and documenting changes and project milestones
- 源文件和派生文件的区别
- 依赖的性质:
- 代码元素之间的依赖关系
- 项目内文件之间的依赖关系
- 以及与存储库外部的文件和程序的依赖关系
- 存储库的性质以及存储库应该如何划分
- 提交和记录变更和项目里程碑的性质
回答by Ana Nelson
I do economics research using R and LaTeX, and I always put my work under version control. It's like having unlimited undo. Try Bazaar, it's one of the simplest to learn and use, and if you're on Windows it has a graphical user interface (TortoiseBZR).
我使用 R 和 LaTeX 进行经济学研究,并且我总是将我的工作置于版本控制之下。这就像无限撤消。试试 Bazaar,它是最容易学习和使用的工具之一,如果您使用的是 Windows,它有一个图形用户界面 (TortoiseBZR)。
Yes, there are additional benefits to version control when working with others, but even on solo projects it makes a lot of sense.
是的,与他人合作时,版本控制还有其他好处,但即使在个人项目中,它也很有意义。
回答by Ken Williams
Right now, you probably think of your work as developing code that will do what you want it to do. After you adopt using a revision control system, you'll think of your work as writing down your legacy in the repository, and making brilliant incremental changes to it. It feels way better.
现在,您可能认为您的工作是开发代码,可以做您想做的事情。在您采用修订控制系统后,您会认为您的工作是在存储库中写下您的遗产,并对其进行出色的增量更改。感觉好多了
回答by duffymo
I would still recommend version control for a solo act like you because having a safety net to catch mistakes can be a great thing to have.
我仍然会为像您这样的独奏者推荐版本控制,因为拥有一个安全网来捕捉错误可能是一件很棒的事情。
I've worked as a solo Java developer, and I still use source control. If I'm checking things in continuously I can't lose more than an hour's work if something goes wrong. I can experiment and refactor without worrying, because if it goes awry I can always roll back to my last working version.
我曾经是一名独立的 Java 开发人员,我仍然使用源代码控制。如果我不断地检查东西,如果出现问题,我不会失去一个多小时的工作。我可以毫无顾虑地进行试验和重构,因为如果出现问题,我总是可以回滚到我的上一个工作版本。
If that's the case for you, I'd recommend using source control. It's not hard to learn.
如果是这种情况,我建议您使用源代码管理。学起来并不难。
回答by dalloliogm
You have to use a version control software, otherwise your analysis won't be perfectly reproducible.
您必须使用版本控制软件,否则您的分析将无法完全重现。
If you want to publish your results somewhere, you should always be able to reconstruct the status of your scripts at the moment you have produced them. Let's say that one of the reviewer discovers an error in one of your scripts: how would you know which results are effected and which are not?
如果您想在某处发布您的结果,您应该始终能够在您生成脚本时重建脚本的状态。假设一位审阅者在您的一个脚本中发现了一个错误:您如何知道哪些结果受到影响,哪些不受影响?
In this sense, a backup system is not sufficient because it is probably done only once per day, and it doesn't apply labels to the different backups, so you don't know which versions correspond to which results. And learning a vcs is simpler than what you think, if learn how to add a file and how to commit changes it is already enough.
从这个意义上说,一个备份系统是不够的,因为它可能每天只做一次,而且它不会给不同的备份贴上标签,所以你不知道哪个版本对应哪个结果。而且学习vcs比你想象的要简单,如果学会了如何添加文件和如何提交更改就已经足够了。
回答by Spacedman
Step back a bit first, and learn the advantages of writing R packages! You say you have projects with several thousand lines of code, yet these aren't structured or documented like package code is? You get big wins with conforming to the package ideals, including documentation for every function, tests for many of the usual hard-to-catch errors, the facility to write test suites of your own etc etc.
先退后一步,了解编写 R 包的优势!你说你有几千行代码的项目,但这些不像包代码那样结构化或记录?遵循包的理念,包括每个函数的文档,对许多常见的难以捕捉的错误的测试,编写自己的测试套件的工具等,您都会获得巨大的胜利。
If you haven't got the discipline to produce a package, then I'm not sure you've got the discipline to do proper revision control.
如果您没有制作包的纪律,那么我不确定您是否有纪律来进行适当的修订控制。
回答by Yin Zhu
Is version control worth the effort?
版本控制值得付出努力吗?
a big YES.
一个大是。
What are the main pros and cons of adopting version control?
采用版本控制的主要优缺点是什么?
pros: you can track what you have done before. Especially useful for latex, as you may need an old paragraph that was deleted by you! When you computer crashes or you work on a new one, you have your data back on the fly.
优点:你可以跟踪你以前做过的事情。对乳胶特别有用,因为您可能需要一个被您删除的旧段落!当您的计算机崩溃或您正在处理一台新计算机时,您的数据会立即恢复。
cons: you need to do some settings.
缺点:你需要做一些设置。
What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?
开始使用 R 进行数据分析的版本控制的好策略是什么(例如,示例、工作流想法、软件、指南链接)?
Just start to use it. I use tortoise SVN on windows as a client tool and my department has an svn server, I put all my code and data (yes, you also put your data there!) there.
刚开始使用它。我在 windows 上使用 tortoise SVN 作为客户端工具,我的部门有一个 svn 服务器,我把我所有的代码和数据(是的,你也把你的数据放在那里!)那里。
回答by PaulHurleyuk
I'd agree with the sentiments above and say that, Yes, version control is usefull.
我同意上述观点并说,是的,版本控制很有用。
Advantages;
好处;
- keep your research recorded as well as backed up, (tagging)
- it lets you try different ideas out and go back if they don't work (branching)
- You can share your work with other people, and they can share their changes to it with you (I know you didn't specify this, but it's great)
- Most version control systems make it easy to create a compressed bundle fo all the files under control at a certain point, for instance at the point you submit an article for publication, this can help when others review your articles. (you can do this manually, but why make up these processes when version control just does it)
- 记录和备份您的研究,(标记)
- 它可以让你尝试不同的想法,如果它们不起作用就回去(分支)
- 您可以与其他人共享您的工作,他们也可以与您共享他们对它的更改(我知道您没有指定这一点,但这很棒)
- 大多数版本控制系统都可以轻松地为某个时间点(例如,在您提交文章以供发布时)为受控制的所有文件创建压缩包,这在其他人查看您的文章时会有所帮助。(您可以手动执行此操作,但是当版本控制仅执行此操作时,为什么还要编写这些过程)
In terms of toolsets, I use Git, along with StatEtand Eclipsewhich works well, although you certainly don't have to use Eclipse. There are a few Git plugins for Eclipse, but I generally use the command line options.
在工具集方面,我使用Git以及StatEt和Eclipse,它们运行良好,尽管您当然不必使用 Eclipse。Eclipse有一些Git 插件,但我通常使用命令行选项。