scala 如何在 sc.textFile 中加载本地文件,而不是 HDFS
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27299923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to load local file in sc.textFile, instead of HDFS
提问by Jas
I'm following the great spark tutorial
我正在关注伟大的火花教程
so i'm trying at 46m:00s to load the README.mdbut fail to what i'm doing is this:
所以我试图在 46m:00s 加载README.md但失败了我正在做的是:
$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
how can I load that README.md?
我怎样才能加载它README.md?
回答by suztomo
Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.
尝试明确指定sc.textFile("file:///path to the file/")。设置Hadoop环境时出现该错误。
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUriif schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".
SparkContext.textFile 内部调用org.apache.hadoop.mapred.FileInputFormat.getSplits,org.apache.hadoop.fs.getDefaultUri如果不存在架构,则依次使用。此方法读取 Hadoop conf 的“fs.defaultFS”参数。如果设置 HADOOP_CONF_DIR 环境变量,参数通常设置为“hdfs://...”;否则为“文件://”。
回答by zaxliu
gonbe's answer is excellent. But still I want to mention that file:///= ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.
gonbe的回答非常好。但我仍然想提一下file:///= ~/../../,而不是$SPARK_HOME。希望这可以为像我这样的新手节省一些时间。
回答by Aklank Jain
While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.
虽然 Spark 支持从本地文件系统加载文件,但它要求文件在集群中的所有节点上都位于同一路径上。
Some network filesystems, like NFS, AFS, and MapR's NFS layer, are exposed to the user as a regular filesystem.
一些网络文件系统,如 NFS、AFS 和 MapR 的 NFS 层,作为常规文件系统向用户公开。
If your data is already in one of these systems, then you can use it as an input by just specifying a file://path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path
如果您的数据已经在这些系统之一中,那么您只需指定file://路径即可将其用作输入;只要文件系统安装在每个节点的相同路径上,Spark 就会处理它。每个节点都需要有相同的路径
rdd = sc.textFile("file:///path/to/file")
If your file isn't already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers
如果您的文件尚未在集群中的所有节点上,您可以将其本地加载到驱动程序上,而无需通过 Spark,然后调用 parallelize 将内容分发给工作人员
Take care to put file:// in front and the use of "/" or "\" according to OS.
请注意将 file:// 放在前面,并根据操作系统使用“/”或“\”。
回答by Matiji66
Attention:
注意力:
Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist.
Becasuse executors which run on different workers will not find this file in it's local path.
当你从 local( sc.textFile("file:///path to the file/"))加载数据时,确保你在本地模式下运行 spark否则你会得到这样的错误Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist。因为在不同 worker 上运行的执行程序不会在它的本地路径中找到这个文件。
回答by Hamdi Charef
You need just to specify the path of the file as "file:///directory/file"
您只需将文件路径指定为“file:///directory/file”
example:
例子:
val textFile = sc.textFile("file:///usr/local/spark/README.md")
回答by Gene
I have a file called NewsArticle.txt on my Desktop.
我的桌面上有一个名为 NewsArticle.txt 的文件。
In Spark, I typed:
在 Spark 中,我输入:
val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)
I needed to change all the \ to / character for the filepath.
我需要将文件路径的所有 \ 更改为 / 字符。
To test if it worked, I typed:
为了测试它是否有效,我输入:
textFile.foreach(println)
I'm running Windows 7 and I don't have Hadoop installed.
我运行的是 Windows 7,但我没有安装 Hadoop。
回答by Joarder Kamal
If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.
如果文件位于您的 Spark 主节点中(例如,在使用 AWS EMR 的情况下),则首先在本地模式下启动 spark-shell。
$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.
或者,您可以先将文件从本地文件系统复制到 HDFS,然后以默认模式启动 Spark(例如,在使用 AWS EMR 的情况下,YARN)直接读取文件。
$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r-- 1 hadoop hadoop 73 2017-05-01 00:49 /hdfs/spark/examples/people.json
$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
回答by Nan Xiao
回答by Binita Bharati
This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs, and starts looking for the input files under hdfs as specified by fs.defaultFSin Hadoop's core-site.xml. Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file>.
这发生在我身上,Spark 2.3 和 Hadoop 也安装在常见的“hadoop”用户主目录下。由于 Spark 和 Hadoop 都安装在同一个公共目录下,Spark 默认将方案视为hdfs,并开始寻找输入文件fs.defaultFS在 Hadoop 中指定的 hdfs 下core-site.xml。在这种情况下,我们需要将方案明确指定为file:///<absoloute path to file>。
回答by Andrushenko Alexander
You do not have to use sc.textFile(...) to convert local files into dataframes. One of options is, to read a local file line by line and then transform it into Spark Dataset. Here is an example for Windows machine in Java:
您不必使用 sc.textFile(...) 将本地文件转换为数据帧。一种选择是,逐行读取本地文件,然后将其转换为 Spark 数据集。以下是 Java 中 Windows 机器的示例:
StructType schemata = DataTypes.createStructType(
new StructField[]{
createStructField("COL1", StringType, false),
createStructField("COL2", StringType, false),
...
}
);
String separator = ";";
String filePath = "C:\work\myProj\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );
List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
String[] vals = line.split(separator);
result.add(vals);
}
} catch (Exception ex) {
System.out.println(ex.getMessage());
throw new RuntimeException(ex);
}
JavaRDD<String[]> jRdd = jsc.parallelize(result);
JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);
Now you can use dataframe datain your code.
现在您可以data在代码中使用数据框。

