Java 如何从 Eclipse/Intellij IDE 运行简单的 Spark 应用程序?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22639137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to run simple Spark app from Eclipse/Intellij IDE?
提问by blue-sky
To ease the development of my map reduce tasks running on Hadoop prior to actually deploying the tasks to Hadoop I test using a simple map reducer I wrote :
为了在将任务实际部署到 Hadoop 之前简化在 Hadoop 上运行的 map reduce 任务的开发,我使用我编写的简单 map reducer 进行了测试:
object mapreduce {
import scala.collection.JavaConversions._
val intermediate = new java.util.HashMap[String, java.util.List[Int]]
//> intermediate : java.util.HashMap[String,java.util.List[Int]] = {}
val result = new java.util.ArrayList[Int] //> result : java.util.ArrayList[Int] = []
def emitIntermediate(key: String, value: Int) {
if (!intermediate.containsKey(key)) {
intermediate.put(key, new java.util.ArrayList)
}
intermediate.get(key).add(value)
} //> emitIntermediate: (key: String, value: Int)Unit
def emit(value: Int) {
println("value is " + value)
result.add(value)
} //> emit: (value: Int)Unit
def execute(data: java.util.List[String], mapper: String => Unit, reducer: (String, java.util.List[Int]) => Unit) {
for (line <- data) {
mapper(line)
}
for (keyVal <- intermediate) {
reducer(keyVal._1, intermediate.get(keyVal._1))
}
for (item <- result) {
println(item)
}
} //> execute: (data: java.util.List[String], mapper: String => Unit, reducer: (St
//| ring, java.util.List[Int]) => Unit)Unit
def mapper(record: String) {
var jsonAttributes = com.nebhale.jsonpath.JsonPath.read("$", record, classOf[java.util.ArrayList[String]])
println("jsonAttributes are " + jsonAttributes)
var key = jsonAttributes.get(0)
var value = jsonAttributes.get(1)
println("key is " + key)
var delims = "[ ]+";
var words = value.split(delims);
for (w <- words) {
emitIntermediate(w, 1)
}
} //> mapper: (record: String)Unit
def reducer(key: String, listOfValues: java.util.List[Int]) = {
var total = 0
for (value <- listOfValues) {
total += value;
}
emit(total)
} //> reducer: (key: String, listOfValues: java.util.List[Int])Unit
var dataToProcess = new java.util.ArrayList[String]
//> dataToProcess : java.util.ArrayList[String] = []
dataToProcess.add("[\"test1\" , \"test1 here is another test1 test1 \"]")
//> res0: Boolean = true
dataToProcess.add("[\"test2\" , \"test2 here is another test2 test1 \"]")
//> res1: Boolean = true
execute(dataToProcess, mapper, reducer) //> jsonAttributes are [test1, test1 here is another test1 test1 ]
//| key is test1
//| jsonAttributes are [test2, test2 here is another test2 test1 ]
//| key is test2
//| value is 2
//| value is 2
//| value is 4
//| value is 2
//| value is 2
//| 2
//| 2
//| 4
//| 2
//| 2
for (keyValue <- intermediate) {
println(keyValue._1 + "->"+keyValue._2.size)//> another->2
//| is->2
//| test1->4
//| here->2
//| test2->2
}
}
This allows me to run my mapreduce tasks within my Eclipse IDE on Windows before deploying to the actual Hadoop cluster. I would like to perform something similar for Spark or have the ability to write Spark code from within Eclipse to test prior to deploying to Spark cluster. Is this possible with Spark ? Since Spark runs on top of Hadoop does this mean I cannot run Spark without first having Hadoop installed ? So in other words can I run the code using just the Spark libraries ? :
这允许我在部署到实际的 Hadoop 集群之前在 Windows 上的 Eclipse IDE 中运行我的 mapreduce 任务。我想为 Spark 执行类似的操作,或者能够在部署到 Spark 集群之前从 Eclipse 中编写 Spark 代码进行测试。这可能与 Spark 吗?由于 Spark 在 Hadoop 之上运行,这是否意味着我不能在没有先安装 Hadoop 的情况下运行 Spark?换句话说,我可以只使用 Spark 库来运行代码吗?:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val logFile = "$YOUR_SPARK_HOME/README.md" // Should be some file on your system
val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME",
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
taken from https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scala
取自https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scala
If so what are the Spark libraries I need to include within my project ?
如果是这样,我需要在我的项目中包含哪些 Spark 库?
回答by Klugschei?er
Add the following to your build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
and make sure your scalaVersion
is set (eg. scalaVersion := "2.10.3"
)
将以下内容添加到您的 build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
并确保您scalaVersion
已设置(例如scalaVersion := "2.10.3"
)
Also if you're just running the program locally, you can skip the last two arguments to SparkContext as follows val sc = new SparkContext("local", "Simple App")
此外,如果您只是在本地运行程序,则可以按如下方式跳过 SparkContext 的最后两个参数 val sc = new SparkContext("local", "Simple App")
Finally, Spark can run on Hadoop but can also run in stand alone mode. See: https://spark.apache.org/docs/0.9.1/spark-standalone.html
最后,Spark 可以在 Hadoop 上运行,但也可以在独立模式下运行。请参阅:https: //spark.apache.org/docs/0.9.1/spark-standalone.html