Scala 中的 MapReduce 实现
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/962075/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MapReduce implementation in Scala
提问by Roman Kagan
I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
我想从 Scala 中找出优秀而健壮的 MapReduce 框架。
回答by Jorge Ortiz
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
补充一下关于 Hadoop 的答案:至少有两个 Scala 包装器可以让使用 Hadoop 变得更加可口。
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
Scala Map Reduce (SMR):http: //scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
SHadoop:http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
UPD 10 月 5 日 11
There is also Scoobiframework, that has awesome expressiveness.
还有Scoobi框架,它具有令人敬畏的表现力。
回答by bayer
http://hadoop.apache.org/is language agnostic.
回答by MattM
Personally, I've become a big fan of Spark
就个人而言,我已经成为 Spark 的忠实粉丝
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
您可以进行内存集群计算,从而显着降低磁盘密集型 mapreduce 操作带来的开销。
回答by Xela
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
对于基于 hadoop 的 scala API,请查看Scoobi,它仍在大量开发中,但显示出很多希望。还有一些努力在Scala 孵化器中的 hadoop 之上实现分布式集合,但这种努力尚不可用。
There is also a new scala wrapper for cascading from Twitter, called Scalding. After looking very briefly over the documentation for Scalding it seems that while it makes the integration with cascading smoother it still does not solve what I see as the main problem with cascading: type safety. Every operation in cascading operates on cascading's tuples (basically a list of field values with or without a separate schema), which means that type errors, I.e. Joining a key as a String and key as a Long leads to run-time failures.
还有一个用于从 Twitter 级联的新 Scala 包装器,称为Scalding。在非常简要地查看 Scalding 的文档后,似乎虽然它使与级联的集成更加顺畅,但仍然没有解决我认为级联的主要问题:类型安全。级联中的每个操作都对级联的元组(基本上是带有或不带有单独模式的字段值列表)进行操作,这意味着类型错误,即将键作为 String 和键作为 Long 连接会导致运行时失败。
回答by AWhitford
回答by bsdfish
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
不久前,我正好遇到了这个问题,并最终编写了一个小基础设施,以便从 Scala 轻松使用 Hadoop。我自己用了一段时间,但我终于把它放在网上。它被命名为(非常原始)ScalaHadoop。
回答by seanc
to further jshen's point:
进一步说明 jshen 的观点:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
hadoop 流只是使用套接字。使用 unix 流,您的代码(任何语言)只需能够从标准输入和输出制表符分隔的流中读取。实现一个映射器,如果需要,一个化简器(如果相关,将其配置为组合器)。

