如何使用 Python 连接 HBase 和 Spark?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38470114/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:49:12  来源:igfitidea点击:

How to connect HBase and Spark using Python?

pythonapache-sparkhbasepysparkapache-spark-sql

提问by Def_Os

I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet to find a satisfactory (i.e., easy to use and scalable) way to read/write HBase data from/to Spark using Python.

我有一个令人尴尬的并行任务,我使用 Spark 来分配计算。这些计算是在 Python 中进行的,我使用 PySpark 来读取和预处理数据。我的任务的输入数据存储在 HBase 中。不幸的是,我还没有找到一种令人满意的(即易于使用和可扩展的)方式来使用 Python 从/向 Spark 读取/写入 HBase 数据。

What I've explored previously:

我之前探索过的内容:

  • Connecting from within my Python processes using happybase. This package allows connecting to HBase from Python by using HBase's Thrift API. This way, I basically skip Spark for data reading/writing and am missing out on potential HBase-Spark optimizations. Read speeds seem reasonably fast, but write speeds are slow. This is currently my best solution.

  • Using SparkContext's newAPIHadoopRDDand saveAsNewAPIHadoopDatasetthat make use of HBase's MapReduce interface. Examples for this were once included in the Spark code base (see here). However, these are now considered outdated in favor of HBase's Spark bindings (see here). I've also found this method to be slow and cumbersome (for reading, writing worked well), for example as the strings returned from newAPIHadoopRDDhad to be parsed and transformed in various ways to end up with the Python objects I wanted. It also only supported a single column at a time.

  • 使用happybase. 这个包允许使用 HBase 的 Thrift API 从 Python 连接到 HBase。这样,我基本上跳过 Spark 进行数据读取/写入,并且错过了潜在的 HBase-Spark 优化。读取速度似乎相当快,但写入速度很慢。这是目前我最好的解决方案。

  • 使用 SparkContextnewAPIHadoopRDDsaveAsNewAPIHadoopDataset利用 HBase 的 MapReduce 接口。这方面的示例曾经包含在 Spark 代码库中(请参阅此处)。但是,这些现在被认为是过时的,有利于 HBase 的 Spark 绑定(请参阅此处)。我还发现这种方法既慢又麻烦(对于读、写效果很好),例如,因为newAPIHadoopRDD必须以各种方式解析和转换从返回的字符串,才能最终得到我想要的 Python 对象。它还一次仅支持一列。

Alternatives that I'm aware of:

我知道的替代方案:

  • I'm currently using Cloudera's CDH and version 5.7.0 offers hbase-spark(CDH release notes, and a detailed blog post). This module (formerly known as SparkOnHBase) will officially be a part of HBase 2.0. Unfortunately, this wonderful solution seems to work only with Scala/Java.

  • Huawei's Spark-SQL-on-HBase/ Astro(I don't see a difference between the two...). It does not look as robust and well-supported as I'd like my solution to be.

  • 我目前正在使用 Cloudera 的 CDH 和 5.7.0 版产品hbase-sparkCDH 发行说明详细的博客文章)。该模块(以前称为SparkOnHBase)将正式成为 HBase 2.0 的一部分。不幸的是,这个绝妙的解决方案似乎只适用于 Scala/Java。

  • 华为的Spark-SQL-on-HBase/ Astro(我看不出两者有什么区别......)。它看起来不像我希望我的解决方案那样健壮和得到很好的支持。

回答by Def_Os

I found this commentby one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.

我找到了 的其中一位制造商的评论hbase-spark,这似乎表明有一种方法可以使用 PySpark 使用 Spark SQL 查询 HBase。

And indeed, the pattern described herecan be applied to query HBase with Spark SQL using PySpark, as the following example shows:

事实上,这里描述的模式可以应用于使用 PySpark 使用 Spark SQL 查询 HBase,如以下示例所示:

from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlc = SQLContext(sc)

data_source_format = 'org.apache.hadoop.hbase.spark'

df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])

# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"testtable"},
    "rowkey":"key",
    "columns":{
        "col0":{"cf":"rowkey", "col":"key", "type":"string"},
        "col1":{"cf":"cf", "col":"col1", "type":"string"}
    }
}""".split())


# Writing
df.write\
.options(catalog=catalog)\  # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()

# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()

I've tried hbase-spark-1.2.0-cdh5.7.0.jar(as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as selectwhen writing, java.util.NoSuchElementException: None.getwhen reading). As it turns out, the present version of CDH does not include the changes to hbase-sparkthat allow Spark SQL-HBase integration.

我已经尝试过hbase-spark-1.2.0-cdh5.7.0.jar(由 Cloudera 分发),但遇到了麻烦(org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select在写作时,java.util.NoSuchElementException: None.get在阅读时)。事实证明,当前版本的 CDH 不包括hbase-spark允许 Spark SQL-HBase 集成的更改。

What doeswork for me is the shcSpark package, found here. The only change I had to make to the above script is to change:

什么的工作对我来说是shc星火包,发现在这里。我必须对上述脚本进行的唯一更改是更改:

data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'

Here's how I submit the above script on my CDH cluster, following the example from the shcREADME:

以下是我如何按照shc自述文件中的示例在我的 CDH 集群上提交上述脚本:

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py

Most of the work on shcseems to already be merged into the hbase-sparkmodule of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframesfor details). My example above shows what it looks like for PySpark users.

大部分工作shc似乎已经合并到hbase-sparkHBase的模块中,以在 2.0 版中发布。这样,可以使用上述模式对 HBase 进行 Spark SQL 查询(有关详细信息,请参阅:https: //hbase.apache.org/book.html#_sparksql_dataframes)。我上面的例子展示了 PySpark 用户的样子。

Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.

最后,一个警告:我上面的示例数据只有字符串。不支持 Python 数据转换shc,因此我遇到了整数和浮点数未显示在 HBase 中或具有奇怪值的问题。