Hadoop 上的 Java 与 Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1482282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 13:18:43  来源:igfitidea点击:

Java vs Python on Hadoop

javapythonhadoop

提问by jnoss

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go either way if there is a significant performance difference one way or the other.

我正在开发一个使用 Hadoop 的项目,它似乎原生地结合了 Java 并为 Python 提供流支持。选择其中之一是否会对性能产生重大影响?我在这个过程中足够早,如果一种方式或另一种方式存在显着的性能差异,我可以采用任何一种方式。

采纳答案by David Crawshaw

Java is less dynamic than Python and more effort has been put into its VM, making it a faster language. Python is also held back by its Global Interpreter Lock, meaning it cannot push threads of a single process onto different core.

Java 不如 Python 动态,并且在其 VM 中投入了更多精力,使其成为一种更快的语言。Python 还受到其全局解释器锁的限制,这意味着它无法将单个进程的线程推送到不同的内核。

Whether this makes any significant difference depends on what you intend to do. I suspect both languages will work for you.

这是否会产生任何显着差异取决于您打算做什么。我怀疑这两种语言都适合你。

回答by Bill K

With Python you'll probably develop faster and with Java will definitely run faster.

使用 Python 您可能会开发得更快,而使用 Java 肯定会运行得更快。

Google "benchmarksgame" if you want to see some very accurate speed comparisons between all popular languages, but if I recall correctly you're talking about 3-5x faster.

如果您想查看所有流行语言之间的一些非常准确的速度比较,请使用谷歌“基准游戏”,但如果我没记错的话,您说的是快 3-5 倍。

That said, few things are processor bound these days, so if you feel like you'd develop better with Python, have at it!

也就是说,现在很少有东西受处理器限制,所以如果你觉得用 Python 开发得更好,那就试试吧!



In reply to comment (how can java be faster than Python):

回复评论(java怎么能比Python快):

All languages are processed differently. Java is about the fastest after C & C++ (which can be as fast or up to 5x faster than java, but seems to average around 2x faster). The rest are from 2-5+ times slower. Python is one of the faster ones after Java. I'm guessing that C# is about as fast as Java or maybe faster, but the benchmarksgame only had Mono (which was a tad slower) because they don't run it on windows.

所有语言的处理方式都不同。Java 大约是 C 和 C++ 之后最快的(它可以与 Java 一样快或快 5 倍,但似乎平均快 2 倍左右)。其余的要慢 2-5 倍以上。Python 是继 Java 之后速度更快的之一。我猜 C# 与 Java 一样快,或者可能更快,但基准游戏只有 Mono(有点慢),因为它们不在 Windows 上运行。

Most of these claims are based on the computer language benchmarks gamewhich tends to be pretty fair because advocates of/experts in each language tweak the test written in their specific language to ensure the code is well-targeted.

大多数这些声明都是基于计算机语言基准测试游戏,这往往是相当公平的,因为每种语言的拥护者/专家都会调整用其特定语言编写的测试,以确保代码具有良好的针对性。

For example, thisshows all tests with Java vs c++ and you can see the speed ranges from about equal to java being 3x slower (first column is between 1 and 3), and java uses much more memory!

例如,显示了 Java 与 c++ 的所有测试,您可以看到速度范围大约等于 java 慢 3 倍(第一列介于 1 和 3 之间),并且 java 使用更多内存!

Now this pageshows java vs python (from the point of view of Python). So the speeds range from python being 2x slower than Java to 174x slower, python generally beats java in code size and memory usage though.

现在这个页面展示了java vs python(从Python的角度来看)。因此,速度范围从 python 比 Java 慢 2 倍到慢 174 倍,尽管 python 在代码大小和内存使用方面通常优于 java。

Another interesting point here--tests that allocated a lot of memory, Java actually performed significantly better than Python in memory size as well. I'm pretty sure java usually loses memory because of the overhead of the VM, but once that factors out, java is probably more efficient than most (again, except the C's).

这里还有一个有趣的点——测试分配了大量内存,Java 在内存大小方面的表现实际上也明显优于 Python。我很确定 java 通常会因为 VM 的开销而丢失内存,但是一旦考虑到这一点,java 可能比大多数更有效(同样,C 除外)。

This is Python 3 by the way, the other python platform tested (Just called Python) faired much worse.

顺便说一下,这是 Python 3,其他经过测试的 Python 平台(简称 Python)的表现要差得多。

If you really wanted to know howit is faster, the VM is amazingly intelligent. It compiles to machine language AFTER running the code, so it knows what the most likely code paths are and optimizes for them. Memory allocation is an art--really useful in an OO language. It can perform some amazing run-time optimizations which no non-VM language can do. It can run in a pretty small memory footprint when forced to, and is a language of choice for embedded devices along with C/C++.

如果你真的想知道它是如何更快的,VM 是非常智能的。它在运行代码后编译为机器语言,因此它知道最可能的代码路径是什么并针对它们进行优化。内存分配是一门艺术——在面向对象语言中非常有用。它可以执行一些非 VM 语言无法做到的惊人的运行时优化。被迫时,它可以在非常小的内存占用中运行,并且是嵌入式设备以及 C/C++ 的首选语言。

I worked on a Signal Analyzer for Agilent (think expensive o-scope) where nearly the entire thing (aside from the sampling) was done in Java. This includes drawing the screen including the trace (AWT) and interacting with the controls.

我在 Agilent 的信号分析仪上工作(想想昂贵的 o-scope),几乎所有的事情(除了采样)都是用 Java 完成的。这包括绘制包含跟踪 (AWT) 的屏幕以及与控件交互。

Currently I'm working on a project for all future cable boxes. The Guide along with most other apps will be written in Java.

目前我正在为所有未来的有线电视盒开展一个项目。该指南以及大多数其他应用程序将使用 Java 编写。

Why wouldn't it be faster than Python?

为什么它不会比 Python 快?

回答by John Prior

You can write Hadoop mapreduce transformations either as "streaming" or as a "custom jar". If you use streaming, you can write your code in any language you like, including Python or C++. Your code will just read from STDIN and output to STDOUT. However, on hadoop versions before 0.21, hadoop streaming used to only stream text - not binary - to your processes. Therefore your files needed to be text files, unless you do some funky encoding transformations yourself. But now it appears a patchhas been added that now allows the use of binary formats with hadoop streaming.

您可以将 Hadoop mapreduce 转换编写为“流”或“自定义 jar”。如果您使用流式传输,则可以使用您喜欢的任何语言编写代码,包括 Python 或 C++。您的代码将仅从 STDIN 读取并输出到 STDOUT。但是,在 0.21 之前的 hadoop 版本上,hadoop 流过去仅将文本(而不是二进制)流式传输到您的进程。因此你的文件需要是文本文件,除非你自己做一些时髦的编码转换。但是现在似乎添加了一个补丁,现在允许在 hadoop 流中使用二进制格式。

If you use a "custom jar" (i.e. you wrote your mapreduce code in Java or Scala using the hadoop libraries), then you will have access to functions that allow you to input and output binary (serialize in binary) from your streaming processes (and save the results to disk). So future runs will be much faster (depending on how much your binary format is smaller than your text format).

如果您使用“自定义 jar”(即您使用 hadoop 库在 Java 或 Scala 中编写了 mapreduce 代码),那么您将可以访问允许您从流处理过程中输入和输出二进制文件(二进制序列化)的函数(并将结果保存到磁盘)。所以未来的运行会快得多(取决于你的二进制格式比你的文本格式小多少)。

So if your hadoop job is going to be I/O bound, then the "custom jar" approach will be faster (since both Java is faster as previous posters have shown and reading from disk will also be faster).

因此,如果您的 hadoop 作业将受 I/O 限制,那么“自定义 jar”方法会更快(因为 Java 都比以前的海报显示的更快,并且从磁盘读取也会更快)。

But you have to ask yourself how valuable is your time. I find myself far more productive with python, and writing map-reduce that reads STDIN and writes to STDOUT is really straightforward. So I personally would recommend going the python route - even if you have to figure the binary encoding stuff out yourself. Since hadoop 0.21 handles non-utf8 byte arrays, and since there is a binary (byte array) alternative to use for python (http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/), which shows the python code only being about 25% slower than the "custom jar" java code, I would definitely go the python route.

但你必须问问自己,你的时间有多宝贵。我发现自己使用 python 效率更高,编写读取 STDIN 并写入 STDOUT 的 map-reduce 非常简单。所以我个人建议走 python 路线 - 即使你必须自己弄清楚二进制编码的东西。由于 hadoop 0.21 处理非 utf8 字节数组,并且由于有一个用于 python 的二进制(字节数组)替代方案(http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/),这表明 python 代码只比“自定义 jar”java 代码慢 25%,我肯定会走 python 路线。