scala 将二进制文件读入 Spark

Question

提问by MaatDeamon

I have a set of file that each contain a specific Record in Marc21binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marclibrary to convert the object into Java Objectfor further processing.

我有一组文件，每个文件都包含一个Marc21二进制格式的特定记录。我想将一组文件作为 RDD 摄取，其中每个元素都是作为二进制数据的记录对象。稍后我将使用一个Marc库将对象转换Java Object为进一步处理。

As of now, I am puzzled as to how i can read a binary file.

到目前为止，我对如何读取二进制文件感到困惑。

I have seen the following function:

我看到了以下功能：

binaryRecord(path: string, recordLength: int, conf)

However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.

但是，它假定它是一个具有多个相同长度记录的文件。我的记录肯定会有不同的大小。每个文件旁边都有一个单独的文件。

Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?

有没有办法解决这个问题？我怎样才能为每个文件给出一个长度？唯一的方法是计算我的文件的长度然后读取记录吗？

The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.

我显然看到的另一个解决方案是以 Java 格式读取记录并将其序列化为任何适合摄取的格式。

Please advise.

请指教。

Answer 1

回答by user5262448

Have you tried sc.binaryFiles() from spark?

您是否尝试过 spark 中的 sc.binaryFiles()？

Here is the link to documentation https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/SparkContext.html#binaryFiles(java.lang.String,%20int)

这是文档的链接 https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/SparkContext.html#binaryFiles(java.lang.String,%20int)

scala 将二进制文件读入 Spark

提问by MaatDeamon

回答by user5262448

相关推荐

最近更新

标签

scala 将二进制文件读入 Spark

提问by MaatDeamon

回答by user5262448

相关推荐

scala 如何在代码的任何位置获取当前 SparkSession？

如何在 Scala 的 Apache Spark 中将数据帧转换为数据集？

将一行转换为 spark scala 中的列表

scala 在文本文件中写入/存储数据帧

相关推荐

最近更新

标签