scala 将二进制文件读入 Spark

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45499827/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:23:41  来源:igfitidea点击:

Reading binary File into Spark

scalaapache-sparkspark-streamingbinaryfilesbinary-data

提问by MaatDeamon

I have a set of file that each contain a specific Record in Marc21binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marclibrary to convert the object into Java Objectfor further processing.

我有一组文件,每个文件都包含一个Marc21二进制格式的特定记录。我想将一组文件作为 RDD 摄取,其中每个元素都是作为二进制数据的记录对象。稍后我将使用一个Marc库将对象转换Java Object为进一步处理。

As of now, I am puzzled as to how i can read a binary file.

到目前为止,我对如何读取二进制文件感到困惑。

I have seen the following function:

我看到了以下功能:

binaryRecord(path: string, recordLength: int, conf)

However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.

但是,它假定它是一个具有多个相同长度记录的文件。我的记录肯定会有不同的大小。每个文件旁边都有一个单独的文件。

Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?

有没有办法解决这个问题?我怎样才能为每个文件给出一个长度?唯一的方法是计算我的文件的长度然后读取记录吗?

The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.

我显然看到的另一个解决方案是以 Java 格式读取记录并将其序列化为任何适合摄取的格式。

Please advise.

请指教。

回答by user5262448