scala 将二进制文件读入 Spark
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45499827/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading binary File into Spark
提问by MaatDeamon
I have a set of file that each contain a specific Record in Marc21binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marclibrary to convert the object into Java Objectfor further processing.
我有一组文件,每个文件都包含一个Marc21二进制格式的特定记录。我想将一组文件作为 RDD 摄取,其中每个元素都是作为二进制数据的记录对象。稍后我将使用一个Marc库将对象转换Java Object为进一步处理。
As of now, I am puzzled as to how i can read a binary file.
到目前为止,我对如何读取二进制文件感到困惑。
I have seen the following function:
我看到了以下功能:
binaryRecord(path: string, recordLength: int, conf)
However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.
但是,它假定它是一个具有多个相同长度记录的文件。我的记录肯定会有不同的大小。每个文件旁边都有一个单独的文件。
Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?
有没有办法解决这个问题?我怎样才能为每个文件给出一个长度?唯一的方法是计算我的文件的长度然后读取记录吗?
The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.
我显然看到的另一个解决方案是以 Java 格式读取记录并将其序列化为任何适合摄取的格式。
Please advise.
请指教。
回答by user5262448
Have you tried sc.binaryFiles() from spark?
您是否尝试过 spark 中的 sc.binaryFiles()?
Here is the link to documentation https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/SparkContext.html#binaryFiles(java.lang.String,%20int)

