如何在 Windows 上设置 Spark?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25481325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to set up Spark on Windows?
提问by Siva
I am trying to setup Apache Spark on Windows.
我正在尝试在 Windows 上设置 Apache Spark。
After searching a bit, I understand that the standalone mode is what I want. Which binaries do I download in order to run Apache spark in windows? I see distributions with hadoop and cdh at the spark download page.
经过一番搜索,我明白我想要的是独立模式。为了在 Windows 中运行 Apache spark,我需要下载哪些二进制文件?我在 spark 下载页面看到了带有 hadoop 和 cdh 的发行版。
I don't have references in web to this. A step by step guide to this is highly appreciated.
我在网络上没有对此的引用。对此的分步指南非常感谢。
采纳答案by jkgeyti
I found the easiest solution on Windows is to build from source.
我发现 Windows 上最简单的解决方案是从源代码构建。
You can pretty much follow this guide: http://spark.apache.org/docs/latest/building-spark.html
您几乎可以按照本指南进行操作:http: //spark.apache.org/docs/latest/building-spark.html
Download and install Maven, and set MAVEN_OPTS
to the value specified in the guide.
下载并安装Maven,并设置MAVEN_OPTS
为指南中指定的值。
But if you're just playing around with Spark, and don't actually need it to run on Windows for any other reason that your own machine is running Windows, I'd strongly suggest you install Spark on a linux virtual machine. The simplest way to get started probably is to download the ready-made images made by Cloudera or Hortonworks, and either use the bundled version of Spark, or install your own from source or the compiled binaries you can get from the spark website.
但是如果你只是在玩 Spark,并且实际上并不需要它在 Windows 上运行,因为你自己的机器运行 Windows 的任何其他原因,我强烈建议你在 Linux 虚拟机上安装 Spark。最简单的入门方法可能是下载由 Cloudera 或 Hortonworks 制作的现成镜像,然后使用 Spark 的捆绑版本,或者从源代码或可以从 spark 网站获得的编译二进制文件安装自己的镜像。
回答by Ani Menon
Steps to install Spark in local mode:
在本地模式下安装 Spark 的步骤:
Install Java 7 or later. To test java installation is complete, open command prompt type
java
and hit enter. If you receive a message'Java' is not recognized as an internal or external command.
You need to configure your environment variables,JAVA_HOME
andPATH
to point to the path of jdk.Set
SCALA_HOME
inControl Panel\System and Security\System
goto "Adv System settings" and add%SCALA_HOME%\bin
in PATH variable in environment variables.Install Python 2.6 or later from Python Download link.
- Download SBT. Install it and set
SBT_HOME
as an environment variable with value as<<SBT PATH>>
. - Download
winutils.exe
from HortonWorks repoor git repo. Since we don't have a local Hadoop installation on Windows we have to downloadwinutils.exe
and place it in abin
directory under a createdHadoop
home directory. SetHADOOP_HOME = <<Hadoop home directory>>
in environment variable. We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
Set
SPARK_HOME
and add%SPARK_HOME%\bin
in PATH variable in environment variables.Run command:
spark-shell
Open
http://localhost:4040/
in a browser to see the SparkContext web UI.
安装 Java 7 或更高版本。要测试 java 安装是否完成,请打开命令提示符类型
java
并按 Enter。如果你收到消息'Java' is not recognized as an internal or external command.
你需要配置你的环境变量,JAVA_HOME
并PATH
指向jdk的路径。设置
SCALA_HOME
在Control Panel\System and Security\System
转到“进阶系统设置”,并增加%SCALA_HOME%\bin
在环境变量PATH变量。从Python 下载链接安装 Python 2.6 或更高版本。
- 下载 SBT。安装它并设置
SBT_HOME
为值为 的环境变量<<SBT PATH>>
。 winutils.exe
从HortonWorks repo或git repo下载。由于我们在 Windows 上没有本地 Hadoop 安装,我们必须下载winutils.exe
并将其放置在bin
创建的Hadoop
主目录下的目录中。HADOOP_HOME = <<Hadoop home directory>>
在环境变量中设置。我们将使用预构建的 Spark 包,因此选择 Spark 预构建包进行Hadoop Spark 下载。下载并解压。
在环境变量中设置
SPARK_HOME
并添加%SPARK_HOME%\bin
PATH 变量。运行命令:
spark-shell
http://localhost:4040/
在浏览器中打开以查看 SparkContext Web UI。
回答by ajnavarro
You can download spark from here:
你可以从这里下载火花:
http://spark.apache.org/downloads.html
http://spark.apache.org/downloads.html
I recommend you this version: Hadoop 2 (HDP2, CDH5)
我推荐你这个版本:Hadoop 2 (HDP2, CDH5)
Since version 1.0.0 there are .cmdscripts to run spark in windows.
从 1.0.0 版开始,就有.cmd脚本可以在 Windows 中运行 spark。
Unpack it using 7zip or similar.
使用 7zip 或类似工具解压它。
To start you can execute /bin/spark-shell.cmd --master local[2]
要开始,您可以执行/bin/spark-shell.cmd --master local[2]
To configure your instance, you can follow this link: http://spark.apache.org/docs/latest/
要配置您的实例,您可以点击此链接:http: //spark.apache.org/docs/latest/
回答by Nishu Tayal
You can use following ways to setup Spark:
您可以使用以下方式设置 Spark:
- Building from Source
- Using prebuilt release
- 从源头构建
- 使用预构建版本
Though there are various ways to build Spark from Source.
First I tried building Spark source with SBT but that requires hadoop. To avoid those issues, I used pre-built release.
尽管有多种方法可以从 Source 构建 Spark。
首先,我尝试使用 SBT 构建 Spark 源代码,但这需要 hadoop。为了避免这些问题,我使用了预先构建的版本。
Instead of Source,I downloaded Prebuilt release for hadoop 2.x version and ran it. For this you need to install Scala as prerequisite.
我没有下载 Source,而是下载了 hadoop 2.x 版本的 Prebuilt 版本并运行了它。为此,您需要安装 Scala 作为先决条件。
I have collated all steps here :
How to run Apache Spark on Windows7 in standalone mode
我在这里整理了所有步骤:
如何在独立模式下在 Windows7 上运行 Apache Spark
Hope it'll help you..!!!
希望能帮到你..!!!
回答by Farah
Trying to work with spark-2.x.x, building Spark source code didn't work for me.
尝试使用 spark-2.xx,构建 Spark 源代码对我不起作用。
So, although I'm not going to use Hadoop, I downloaded the pre-built Spark with hadoop embeded :
spark-2.0.0-bin-hadoop2.7.tar.gz
Point SPARK_HOME on the extracted directory, then add to
PATH
:;%SPARK_HOME%\bin;
Download the executable winutilsfrom the Hortonworks repository, or from Amazon AWS platform winutils.
Create a directory where you place the executable winutils.exe. For example, C:\SparkDev\x64. Add the environment variable
%HADOOP_HOME%
which points to this directory, then add%HADOOP_HOME%\bin
to PATH.Using command line, create the directory:
mkdir C:\tmp\hive
Using the executable that you downloaded, add full permissions to the file directory you created but using the unixian formalism:
%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
Type the following command line:
%SPARK_HOME%\bin\spark-shell
因此,虽然我不打算使用 Hadoop,但我下载了嵌入了 hadoop 的预构建 Spark:
spark-2.0.0-bin-hadoop2.7.tar.gz
将 SPARK_HOME 指向提取的目录,然后添加到
PATH
:;%SPARK_HOME%\bin;
创建一个放置可执行文件 winutils.exe 的目录。例如,C:\SparkDev\x64。添加
%HADOOP_HOME%
指向此目录的环境变量,然后添加%HADOOP_HOME%\bin
到PATH。使用命令行,创建目录:
mkdir C:\tmp\hive
使用您下载的可执行文件,为您创建的文件目录添加完全权限,但使用 unixian 形式:
%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
键入以下命令行:
%SPARK_HOME%\bin\spark-shell
Scala command line input should be shown automatically.
Scala 命令行输入应自动显示。
Remark :You don't need to configure Scala separately. It's built-in too.
备注:您不需要单独配置 Scala。它也是内置的。
回答by Emul
Here's the fixes to get it to run in Windows without rebuilding everything - such as if you do not have a recent version of MS-VS. (You will need a Win32 C++ compiler, but you can install MS VS Community Edition free.)
这是使其在 Windows 中运行而无需重建所有内容的修复程序 - 例如,如果您没有最新版本的 MS-VS。(您将需要一个 Win32 C++ 编译器,但您可以免费安装 MS VS 社区版。)
I've tried this with Spark 1.2.2 and mahout 0.10.2 as well as with the latest versions in November 2015. There are a number of problems including the fact that the Scala code tries to run a bash script (mahout/bin/mahout) which does not work of course, the sbin scripts have not been ported to windows, and the winutils are missing if hadoop is not installed.
我已经尝试过使用 Spark 1.2.2 和 mahout 0.10.2 以及 2015 年 11 月的最新版本。存在许多问题,包括 Scala 代码尝试运行 bash 脚本(mahout/bin/ mahout)当然不起作用,sbin 脚本尚未移植到 Windows,如果未安装 hadoop,则 winutils 将丢失。
(1)Install scala, then unzip spark/hadoop/mahout into the root of C: under their respective product names.
(1)安装scala,然后将spark/hadoop/mahout解压到C:根目录下各自的产品名称下。
(2)Rename \mahout\bin\mahout to mahout.sh.was (we will not need it)
(2)将\mahout\bin\mahout 重命名为mahout.sh.was(我们不需要它)
(3)Compile the following Win32 C++ program and copy the executable to a file named C:\mahout\bin\mahout (that's right - no .exe suffix, like a Linux executable)
(3)编译下面的Win32 C++程序,将可执行文件复制到一个名为C:\mahout\bin\mahout的文件中(没错——没有.exe后缀,就像Linux的可执行文件一样)
#include "stdafx.h"
#define BUFSIZE 4096
#define VARNAME TEXT("MAHOUT_CP")
int _tmain(int argc, _TCHAR* argv[]) {
DWORD dwLength; LPTSTR pszBuffer;
pszBuffer = (LPTSTR)malloc(BUFSIZE*sizeof(TCHAR));
dwLength = GetEnvironmentVariable(VARNAME, pszBuffer, BUFSIZE);
if (dwLength > 0) { _tprintf(TEXT("%s\n"), pszBuffer); return 0; }
return 1;
}
(4)Create the script \mahout\bin\mahout.bat and paste in the content below, although the exact names of the jars in the _CP class paths will depend on the versions of spark and mahout. Update any paths per your installation. Use 8.3 path names without spaces in them. Note that you cannot use wildcards/asterisks in the classpaths here.
(4)创建脚本\mahout\bin\mahout.bat 并粘贴下面的内容,尽管_CP 类路径中jar 的确切名称将取决于spark 和mahout 的版本。根据您的安装更新任何路径。使用 8.3 路径名,其中没有空格。请注意,您不能在此处的类路径中使用通配符/星号。
set SCALA_HOME=C:\Progra~2\scala
set SPARK_HOME=C:\spark
set HADOOP_HOME=C:\hadoop
set MAHOUT_HOME=C:\mahout
set SPARK_SCALA_VERSION=2.10
set MASTER=local[2]
set MAHOUT_LOCAL=true
set path=%SCALA_HOME%\bin;%SPARK_HOME%\bin;%PATH%
cd /D %SPARK_HOME%
set SPARK_CP=%SPARK_HOME%\conf\;%SPARK_HOME%\lib\xxx.jar;...other jars...
set MAHOUT_CP=%MAHOUT_HOME%\lib\xxx.jar;...other jars...;%MAHOUT_HOME%\xxx.jar;...other jars...;%SPARK_CP%;%MAHOUT_HOME%\lib\spark\xxx.jar;%MAHOUT_HOME%\lib\hadoop\xxx.jar;%MAHOUT_HOME%\src\conf;%JAVA_HOME%\lib\tools.jar
start "master0" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip localhost --port 7077 --webui-port 8082 >>out-master0.log 2>>out-master0.err
start "worker1" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker spark://localhost:7077 --webui-port 8083 >>out-worker1.log 2>>out-worker1.err
...you may add more workers here...
cd /D %MAHOUT_HOME%
"%JAVA_HOME%\bin\java" -Xmx4g -classpath "%MAHOUT_CP%" "org.apache.mahout.sparkbindings.shell.Main"
The name of the variable MAHOUT_CP should not be changed, as it is referenced in the C++ code.
变量 MAHOUT_CP 的名称不应更改,因为它在 C++ 代码中被引用。
Of course you can comment-out the code that launches the Spark master and worker because Mahout will run Spark as-needed; I just put it in the batch job to show you how to launch it if you wanted to use Spark without Mahout.
当然你可以注释掉启动 Spark master 和 worker 的代码,因为 Mahout 会根据需要运行 Spark;我只是将它放在批处理作业中,向您展示如果您想在没有 Mahout 的情况下使用 Spark,如何启动它。
(5)The following tutorial is a good place to begin:
(5)以下教程是一个很好的开始:
https://mahout.apache.org/users/sparkbindings/play-with-shell.html
You can bring up the Mahout Spark instance at:
您可以在以下位置启动 Mahout Spark 实例:
"C:\Program Files (x86)\Google\Chrome\Application\chrome" --disable-web-security http://localhost:4040
回答by Chris
The guide by Ani Menon (thx!) almost worked for me on windows 10, i just had to get a newer winutils.exe off that git (currently hadoop-2.8.1): https://github.com/steveloughran/winutils
Ani Menon 的指南(谢谢!)几乎在 Windows 10 上对我有用,我只需要从那个 git(当前是 hadoop-2.8.1)中获取一个更新的 winutils.exe:https: //github.com/steveloughran/winutils
回答by Aakash Saxena
Here are seven steps to install spark on windows 10 and run it from python:
以下是在 Windows 10 上安装 spark 并从 python 运行它的七个步骤:
Step 1: download the spark 2.2.0 tar (tape Archive) gz file to any folder F from this link - https://spark.apache.org/downloads.html. Unzip it and copy the unzipped folder to the desired folder A. Rename the spark-2.2.0-bin-hadoop2.7 folder to spark.
第 1 步:通过此链接将 spark 2.2.0 tar(磁带存档)gz 文件下载到任何文件夹 F - https://spark.apache.org/downloads.html。解压,将解压后的文件夹复制到需要的文件夹A,将spark-2.2.0-bin-hadoop2.7文件夹重命名为spark。
Let path to the spark folder be C:\Users\Desktop\A\spark
让 spark 文件夹的路径为 C:\Users\Desktop\A\spark
Step 2: download the hardoop 2.7.3 tar gz file to the same folder F from this link - https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz. Unzip it and copy the unzipped folder to the same folder A. Rename the folder name from Hadoop-2.7.3.tar to hadoop. Let path to the hadoop folder be C:\Users\Desktop\A\hadoop
第 2 步:从此链接下载 hardoop 2.7.3 tar gz 文件到同一文件夹 F - https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7 .3.tar.gz。解压并将解压后的文件夹复制到同一个文件夹A中。将文件夹名称从Hadoop-2.7.3.tar重命名为hadoop。让 hadoop 文件夹的路径为 C:\Users\Desktop\A\hadoop
Step 3: Create a new notepad text file. Save this empty notepad file as winutils.exe (with Save as type: All files). Copy this O KB winutils.exe file to your bin folder in spark - C:\Users\Desktop\A\spark\bin
第 3 步:创建一个新的记事本文本文件。将此空记事本文件另存为 winutils.exe(保存类型为:所有文件)。将此 O KB winutils.exe 文件复制到 spark 中的 bin 文件夹 - C:\Users\Desktop\A\spark\bin
Step 4: Now, we have to add these folders to the System environment.
第 4 步:现在,我们必须将这些文件夹添加到系统环境中。
4a: Create a system variable (not user variable as user variable will inherit all the properties of the system variable) Variable name: SPARK_HOME Variable value: C:\Users\Desktop\A\spark
4a:创建系统变量(不是用户变量,因为用户变量会继承系统变量的所有属性) 变量名:SPARK_HOME 变量值:C:\Users\Desktop\A\spark
Find Path system variable and click edit. You will see multiple paths. Do not delete any of the paths. Add this variable value - ;C:\Users\Desktop\A\spark\bin
找到路径系统变量并单击编辑。您将看到多个路径。不要删除任何路径。添加此变量值 - ;C:\Users\Desktop\A\spark\bin
4b: Create a system variable
4b:创建系统变量
Variable name: HADOOP_HOME Variable value: C:\Users\Desktop\A\hadoop
变量名:HADOOP_HOME 变量值:C:\Users\Desktop\A\hadoop
Find Path system variable and click edit. Add this variable value - ;C:\Users\Desktop\A\hadoop\bin
找到路径系统变量并单击编辑。添加这个变量值 - ;C:\Users\Desktop\A\hadoop\bin
4c: Create a system variable Variable name: JAVA_HOME Search Java in windows. Right click and click open file location. You will have to again right click on any one of the java files and click on open file location. You will be using the path of this folder. OR you can search for C:\Program Files\Java. My Java version installed on the system is jre1.8.0_131. Variable value: C:\Program Files\Java\jre1.8.0_131\bin
4c:创建系统变量变量名:JAVA_HOME 在windows中搜索Java。右键单击并单击打开文件位置。您必须再次右键单击任一 Java 文件,然后单击打开文件位置。您将使用此文件夹的路径。或者您可以搜索 C:\Program Files\Java。我在系统上安装的Java版本是jre1.8.0_131。变量值:C:\Program Files\Java\jre1.8.0_131\bin
Find Path system variable and click edit. Add this variable value - ;C:\Program Files\Java\jre1.8.0_131\bin
找到路径系统变量并单击编辑。添加这个变量值 - ;C:\Program Files\Java\jre1.8.0_131\bin
Step 5: Open command prompt and go to your spark bin folder (type cd C:\Users\Desktop\A\spark\bin). Type spark-shell.
步骤 5:打开命令提示符并转到您的 spark bin 文件夹(键入 cd C:\Users\Desktop\A\spark\bin)。类型火花壳。
C:\Users\Desktop\A\spark\bin>spark-shell
It may take time and give some warnings. Finally, it will show welcome to spark version 2.2.0
这可能需要时间并给出一些警告。最后,它会显示欢迎使用 spark 2.2.0 版
Step 6: Type exit() or restart the command prompt and go the spark bin folder again. Type pyspark:
第 6 步:键入 exit() 或重新启动命令提示符并再次转到 spark bin 文件夹。输入pyspark:
C:\Users\Desktop\A\spark\bin>pyspark
It will show some warnings and errors but ignore. It works.
它会显示一些警告和错误但忽略。有用。
Step 7: Your download is complete. If you want to directly run spark from python shell then: go to Scripts in your python folder and type
第 7 步:您的下载已完成。如果您想直接从 python shell 运行 spark,则:转到 python 文件夹中的脚本并键入
pip install findspark
in command prompt.
在命令提示符中。
In python shell
在 python 外壳中
import findspark
findspark.init()
import the necessary modules
导入必要的模块
from pyspark import SparkContext
from pyspark import SparkConf
If you would like to skip the steps for importing findspark and initializing it, then please follow the procedure given in importing pyspark in python shell
如果您想跳过导入 findspark 并对其进行初始化的步骤,请按照在 python shell中导入 pyspark 中给出的步骤进行操作
回答by Divine
Cloudera and Hortonworks are the best tools to start up with the HDFS in Microsoft Windows. You can also use VMWare or VBox to initiate Virtual Machine to establish build to your HDFS and Spark, Hive, HBase, Pig, Hadoop with Scala, R, Java, Python.
Cloudera 和 Hortonworks 是在 Microsoft Windows 中启动 HDFS 的最佳工具。您还可以使用 VMWare 或 VBox 启动虚拟机,以使用 Scala、R、Java、Python 建立对 HDFS 和 Spark、Hive、HBase、Pig、Hadoop 的构建。
回答by HansHarhoff
Here is a simple minimum script to run from any python console. It assumes that you have extracted the Spark libraries that you have downloaded into C:\Apache\spark-1.6.1.
这是一个可以从任何 python 控制台运行的简单的最小脚本。它假定您已将下载的 Spark 库解压到 C:\Apache\spark-1.6.1。
This works in Windows without building anything and solves problems where Spark would complain about recursive pickling.
这适用于 Windows,无需构建任何东西,并解决了 Spark 抱怨递归酸洗的问题。
import sys
import os
spark_home = 'C:\Apache\spark-1.6.1'
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\pyspark.zip'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\py4j-0.9-src.zip'))
# Start a spark context:
sc = pyspark.SparkContext()
#
lines = sc.textFile(os.path.join(spark_home, "README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines.first()