如何使用 Python 连接 HBase 和 Spark？

Question

提问by Def_Os

I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet to find a satisfactory (i.e., easy to use and scalable) way to read/write HBase data from/to Spark using Python.

我有一个令人尴尬的并行任务，我使用 Spark 来分配计算。这些计算是在 Python 中进行的，我使用 PySpark 来读取和预处理数据。我的任务的输入数据存储在 HBase 中。不幸的是，我还没有找到一种令人满意的（即易于使用和可扩展的）方式来使用 Python 从/向 Spark 读取/写入 HBase 数据。

What I've explored previously:

我之前探索过的内容：

Connecting from within my Python processes using happybase. This package allows connecting to HBase from Python by using HBase's Thrift API. This way, I basically skip Spark for data reading/writing and am missing out on potential HBase-Spark optimizations. Read speeds seem reasonably fast, but write speeds are slow. This is currently my best solution.
Using SparkContext's newAPIHadoopRDDand saveAsNewAPIHadoopDatasetthat make use of HBase's MapReduce interface. Examples for this were once included in the Spark code base (see here). However, these are now considered outdated in favor of HBase's Spark bindings (see here). I've also found this method to be slow and cumbersome (for reading, writing worked well), for example as the strings returned from newAPIHadoopRDDhad to be parsed and transformed in various ways to end up with the Python objects I wanted. It also only supported a single column at a time.

使用happybase. 这个包允许使用 HBase 的 Thrift API 从 Python 连接到 HBase。这样，我基本上跳过 Spark 进行数据读取/写入，并且错过了潜在的 HBase-Spark 优化。读取速度似乎相当快，但写入速度很慢。这是目前我最好的解决方案。
使用 SparkContextnewAPIHadoopRDD并saveAsNewAPIHadoopDataset利用 HBase 的 MapReduce 接口。这方面的示例曾经包含在 Spark 代码库中（请参阅此处）。但是，这些现在被认为是过时的，有利于 HBase 的 Spark 绑定（请参阅此处）。我还发现这种方法既慢又麻烦（对于读、写效果很好），例如，因为newAPIHadoopRDD必须以各种方式解析和转换从返回的字符串，才能最终得到我想要的 Python 对象。它还一次仅支持一列。

Alternatives that I'm aware of:

我知道的替代方案：

I'm currently using Cloudera's CDH and version 5.7.0 offers hbase-spark(CDH release notes, and a detailed blog post). This module (formerly known as SparkOnHBase) will officially be a part of HBase 2.0. Unfortunately, this wonderful solution seems to work only with Scala/Java.
Huawei's Spark-SQL-on-HBase/ Astro(I don't see a difference between the two...). It does not look as robust and well-supported as I'd like my solution to be.

我目前正在使用 Cloudera 的 CDH 和 5.7.0 版产品hbase-spark（CDH 发行说明和详细的博客文章）。该模块（以前称为SparkOnHBase）将正式成为 HBase 2.0 的一部分。不幸的是，这个绝妙的解决方案似乎只适用于 Scala/Java。
华为的Spark-SQL-on-HBase/ Astro（我看不出两者有什么区别......）。它看起来不像我希望我的解决方案那样健壮和得到很好的支持。

Answer 1

回答by Def_Os

I found this commentby one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.

我找到了的其中一位制造商的评论hbase-spark，这似乎表明有一种方法可以使用 PySpark 使用 Spark SQL 查询 HBase。

And indeed, the pattern described herecan be applied to query HBase with Spark SQL using PySpark, as the following example shows:

事实上，这里描述的模式可以应用于使用 PySpark 使用 Spark SQL 查询 HBase，如以下示例所示：

from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlc = SQLContext(sc)

data_source_format = 'org.apache.hadoop.hbase.spark'

df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])

# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"testtable"},
    "rowkey":"key",
    "columns":{
        "col0":{"cf":"rowkey", "col":"key", "type":"string"},
        "col1":{"cf":"cf", "col":"col1", "type":"string"}
    }
}""".split())


# Writing
df.write\
.options(catalog=catalog)\  # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()

# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()

I've tried hbase-spark-1.2.0-cdh5.7.0.jar(as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as selectwhen writing, java.util.NoSuchElementException: None.getwhen reading). As it turns out, the present version of CDH does not include the changes to hbase-sparkthat allow Spark SQL-HBase integration.

我已经尝试过hbase-spark-1.2.0-cdh5.7.0.jar（由 Cloudera 分发），但遇到了麻烦（org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select在写作时，java.util.NoSuchElementException: None.get在阅读时）。事实证明，当前版本的 CDH 不包括hbase-spark允许 Spark SQL-HBase 集成的更改。

What doeswork for me is the shcSpark package, found here. The only change I had to make to the above script is to change:

什么做的工作对我来说是shc星火包，发现在这里。我必须对上述脚本进行的唯一更改是更改：

data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'

Here's how I submit the above script on my CDH cluster, following the example from the shcREADME:

以下是我如何按照shc自述文件中的示例在我的 CDH 集群上提交上述脚本：

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py

Most of the work on shcseems to already be merged into the hbase-sparkmodule of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframesfor details). My example above shows what it looks like for PySpark users.

大部分工作shc似乎已经合并到hbase-sparkHBase的模块中，以在 2.0 版中发布。这样，可以使用上述模式对 HBase 进行 Spark SQL 查询（有关详细信息，请参阅：https: //hbase.apache.org/book.html#_sparksql_dataframes）。我上面的例子展示了 PySpark 用户的样子。

Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.

最后，一个警告：我上面的示例数据只有字符串。不支持 Python 数据转换shc，因此我遇到了整数和浮点数未显示在 HBase 中或具有奇怪值的问题。

如何使用 Python 连接 HBase 和 Spark？

提问by Def_Os

回答by Def_Os

相关推荐

最近更新

标签

如何使用 Python 连接 HBase 和 Spark？

提问by Def_Os

回答by Def_Os

相关推荐

Python 如何将 Vector 拆分成列 - 使用 PySpark

Python 如何迭代从 groupby().size() 生成的 Pandas 系列

Python ModuleNotFoundError：没有名为“numpy.testing.nosetester”的模块

Python 无法在 tensorflow CPU-only 安装上加载动态库“cudart64_101.dll”

相关推荐

最近更新

标签