scala 上的 hadoop 有哪些选项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14596500/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
what are the options for hadoop on scala
提问by prassee
We are starting a big-data based analytic project and we are considering to adopt scala (typesafe stack). I would like to know the various scala API's/projects which are available to do hadoop , map reduce programs.
我们正在启动一个基于大数据的分析项目,我们正在考虑采用 scala(类型安全堆栈)。我想知道可用于执行 hadoop 和 map reduce 程序的各种 Scala API/项目。
采纳答案by arkajit
Definitely check out Scalding. Speaking as a user and occasional contributor, I've found it to be a very useful tool. The Scalding API is also meant to be very compatible with the standard Scala collections API. Just as you can call flatMap, map, or groupBy on normal collections, you can do the same on scalding Pipes, which you can imagine as a distributed List of tuples. There's also a typed version of the API which provides stronger type-safety guarantees. I haven't used Scoobi, but the API seems similar to what they have.
一定要检查一下烫伤。作为用户和偶尔的贡献者,我发现它是一个非常有用的工具。Scalding API 还旨在与标准 Scala 集合 API 非常兼容。正如您可以在普通集合上调用 flatMap、map 或 groupBy 一样,您也可以在 scalding Pipes 上执行相同的操作,您可以将其想象为一个分布式元组列表。还有一个 API 的类型化版本,它提供了更强的类型安全保证。我没有使用 Scoobi,但 API 似乎与他们的相似。
Additionally, there are a few other benefits:
此外,还有其他一些好处:
- Scalding is heavily used in production at Twitter and has been battle-tested on Twitter-scale datasets.
- It has several active contributors both inside and outside Twitter that are committed to making it great.
- It is interoperable with your existing Cascading jobs.
- In addition to the Typed API, it has a a Fields API which may be more familiar to users of R and data-frame frameworks.
- It provides a robust Matrix Library.
- Scalding 在 Twitter 的生产中大量使用,并且已经在 Twitter 规模的数据集上进行了实战测试。
- 它在 Twitter 内部和外部都有几个活跃的贡献者,他们致力于让它变得更好。
- 它可以与您现有的级联作业互操作。
- 除了 Typed API,它还有一个 Fields API,对于 R 和数据框架框架的用户来说可能更熟悉。
- 它提供了一个强大的矩阵库。
回答by dhg
I've had success with Scoobi. It's straightforward to use, strongly typed, hides most of the Hadoop mess (by doing thing like automatically serializing your objects for you), and totally Scala. One of the things I like about its API is that the designers wanted the Scoobi collections to feel just like the standard Scala collections, so you actually use them much the same way, except that operations run on Hadoop instead of locally. This actually makes it pretty easy to switch between Scoobi collections and Scala collections while you're developing and testing.
我在Scoobi 上取得了成功。它使用简单,强类型,隐藏了大部分 Hadoop 混乱(通过为您自动序列化对象),并且完全是 Scala。我喜欢它的 API 的一件事是,设计者希望 Scoobi 集合感觉就像标准的 Scala 集合一样,所以你实际上以大致相同的方式使用它们,除了操作在 Hadoop 上运行而不是在本地运行。这实际上使得在开发和测试时在 Scoobi 集合和 Scala 集合之间切换变得非常容易。
I've also used Scrunch, which is built on top of the Java-based Crunch. I haven't used it in a while, but it's now part of Apache.
我还使用了Scrunch,它建立在基于 Java 的 Crunch 之上。我有一段时间没有使用它,但它现在是 Apache 的一部分。
回答by Dean Wampler
Twitter is investing a lot of effort into Scalding, including a nice Matrix library that could be used for various machine learning tasks. I need to give Scoobi a try, too.
Twitter 在 Scalding 上投入了大量精力,包括一个不错的 Matrix 库,可用于各种机器学习任务。我也需要试试 Scoobi。
For completeness, if you're not wedded to MapReduce, have a look at the Sparkproject. It performs far better in many scenarios, including in their port of Hive to Spark, appropriately called Shark. As a frequent Hive user, I'm excited about that one.
为完整起见,如果您不喜欢 MapReduce,请查看Spark项目。它在许多场景中的表现要好得多,包括在他们的 Hive 到 Spark 的端口中,适当地称为Shark。作为 Hive 的常客,我对此感到很兴奋。
回答by Thomas Lockney
回答by Robert Metzger
Another option is Stratosphere, It offers a Scala APIthat converts the Scala types to Stratosphere's internal data types.
另一种选择是Stratosphere,它提供了一个Scala API,可以将 Scala 类型转换为 Stratosphere 的内部数据类型。
The API is quite similar to Scalding but Stratosphere natively supports advanced data flows (so you don't have to chain MapReduce Jobs). You will have much better performance with Stratosphere than with Scalding.
API 与 Scalding 非常相似,但 Stratosphere 本身支持高级数据流(因此您不必链接 MapReduce 作业)。使用 Stratosphere 将比使用 Scalding 获得更好的性能。
Stratosphere does not run on Hadoop MapReduce but on Hadoop YARN, so you can use your existing YARN cluster.
Stratosphere 不在 Hadoop MapReduce 上运行,而是在Hadoop YARN上运行,因此您可以使用现有的 YARN 集群。
This is the word count example in Stratosphere (with the Scala API):
这是 Stratosphere 中的字数示例(使用 Scala API):
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

