如何通过 Python 访问 Hive?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21370431/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:42:58  来源:igfitidea点击:

How to Access Hive via Python?

pythonhadoophive

提问by Matthew Moisen

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Pythonappears to be outdated.

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python似乎已经过时。

When I add this to /etc/profile:

当我将其添加到 /etc/profile 时:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py

I can then do the imports as listed in the link, with the exception of from hive import ThriftHivewhich actually need to be:

然后我可以执行链接中列出的导入,from hive import ThriftHive但实际上需要的是:

from hive_service import ThriftHive

Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.

接下来示例中的端口是 10000,当我尝试时它导致程序挂起。默认的 Hive Thrift 端口是 9083,它停止了挂起。

So I set it up like so:

所以我这样设置:

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
    transport = TSocket.TSocket('<node-with-metastore>', 9083)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = ThriftHive.Client(protocol)
    transport.open()
    client.execute("CREATE TABLE test(c1 int)")

    transport.close()
except Thrift.TException, tx:
    print '%s' % (tx.message)

I received the following error:

我收到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'

But inspecting the ThriftHive.py file reveals the method execute within the Client class.

但是检查 ThriftHive.py 文件会发现在 Client 类中执行的方法。

How may I use Python to access Hive?

如何使用 Python 访问 Hive?

回答by Naveen Subramani

You can use hive library,for that you want to import hive Class from hive import ThriftHive

您可以使用 hive 库,因为您想从 hive import ThriftHive 导入 hive Class

Try This example:

试试这个例子:

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
  transport = TSocket.TSocket('localhost', 10000)
  transport = TTransport.TBufferedTransport(transport)
  protocol = TBinaryProtocol.TBinaryProtocol(transport)
  client = ThriftHive.Client(protocol)
  transport.open()
  client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
  client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
  client.execute("SELECT * FROM r")
  while (1):
    row = client.fetchOne()
    if (row == None):
       break
    print row

  client.execute("SELECT * FROM r")
  print client.fetchAll()
  transport.close()
except Thrift.TException, tx:
  print '%s' % (tx.message)

回答by hustdelta

I assert that you are using HiveServer2, which is the reason that makes the code doesn't work.

我断言您使用的是 HiveServer2,这就是使代码不起作用的原因。

You may use pyhs2 to access your Hive correctly and the example code like that:

您可以使用 pyhs2 正确访问您的 Hive 以及类似的示例代码:

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute("select * from table")

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i

Attention that you may install python-devel.x86_64 cyrus-sasl-devel.x86_64 before installing pyhs2 with pip.

注意在用pip安装pyhs2之前,你可以先安装python-devel.x86_64 cyrus-sasl-devel.x86_64。

Wish this can help you.

希望这可以帮助你。

Reference: https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

参考:https: //cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

回答by Ting Yu

The examples above are a bit out of date. One new example is here:

上面的例子有点过时了。一个新的例子在这里:

import pyhs2 as hive
import getpass
DEFAULT_DB = 'default'
DEFAULT_SERVER = '10.37.40.1'
DEFAULT_PORT = 10000
DEFAULT_DOMAIN = 'PAM01-PRD01.IBM.COM'

u = raw_input('Enter PAM username: ')
s = getpass.getpass()
connection = hive.connect(host=DEFAULT_SERVER, port= DEFAULT_PORT, authMechanism='LDAP', user=u + '@' + DEFAULT_DOMAIN, password=s)
statement = "select * from user_yuti.Temp_CredCard where pir_post_dt = '2014-05-01' limit 100"
cur = connection.cursor()

cur.execute(statement)
df = cur.fetchall() 

In addition to the standard python program, a few libraries need to be installed to allow Python to build the connection to the Hadoop databae.

除了标准的python 程序外,还需要安装一些库以允许Python 建立与Hadoop 数据库的连接。

1.Pyhs2, Python Hive Server 2 Client Driver

1.Pyhs2,Python Hive Server 2 客户端驱动

2.Sasl, Cyrus-SASL bindings for Python

2.Sasl,Python 的 Cyrus-SASL 绑定

3.Thrift, Python bindings for the Apache Thrift RPC system

3.Thrift,Apache Thrift RPC 系统的 Python 绑定

4.PyHive, Python interface to Hive

4.PyHive,Hive的Python接口

Remember to change the permission of the executable

记得修改可执行文件的权限

chmod +x test_hive2.py ./test_hive2.py

chmod +x test_hive2.py ./test_hive2.py

Wish it helps you. Reference: https://sites.google.com/site/tingyusz/home/blogs/hiveinpython

希望对你有帮助。参考:https: //sites.google.com/site/tingyusz/home/blogs/hiveinpython

回答by python-starter

Below python program should work to access hive tables from python:

下面的python程序应该可以从python访问hive表:

import commands

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

回答by Tristan Reid

I believe the easiest way is to use PyHive.

我相信最简单的方法是使用 PyHive。

To install you'll need these libraries:

要安装,您将需要这些库:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case.

请注意,虽然您将库安装为PyHive,但您将模块导入为pyhive,全部为小写。

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --installin Terminal)

如果您使用的是 Linux,则可能需要在运行上述之前单独安装 SASL。使用 apt-get 或 yum 或任何适用于您的发行版的软件包管理器安装软件包 libsasl2-dev。对于 Windows,GNU.org 上有一些选项,您可以下载二进制安装程序。如果您已经安装了 xcode 开发人员工具(xcode-select --install在终端中),则在 Mac 上应该可以使用 SASL

After installation, you can connect to Hive like this:

安装后,您可以像这样连接到 Hive:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

现在您有了 hive 连接,您可以选择如何使用它。您可以直接查询:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...or to use the connection to make a Pandas dataframe:

...或使用连接制作 Pandas 数据框:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

回答by Tagar

pyhs2 is no longer maintained. A better alternative is impyla

pyhs2 不再维护。更好的选择是impyla

Don't be confused that some of the above examples below about Impala; just change port to 10000 (default) for HiveServer2, and it'll work the same way as with Impala examples. It's the same protocol (Thrift) that is used for both Impala and Hive.

不要对上面的一些关于 Impala 的例子感到困惑;只需将HiveServer2 的端口更改为 10000(默认),它的工作方式与 Impala 示例相同。它是用于 Impala 和 Hive 的相同协议 (Thrift)。

https://github.com/cloudera/impyla

https://github.com/cloudera/impyla

It has many more features over pyhs2, for example, it has Kerberos authentication, which is a must for us.

它比 pyhs2 有更多的功能,例如,它具有 Kerberos 身份验证,这对我们来说是必须的。

from impala.dbapi import connect
conn = connect(host='my.host.com', port=10000)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description  # prints the result set's schema
results = cursor.fetchall()

##
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
    process(row)

Cloudera is putting more effort now on hs2 client https://github.com/cloudera/hs2clientwhich is a C/C++ HiveServer2/Impala client. Might be a better option if you push a lot of data to/from python. (has Python binding too - https://github.com/cloudera/hs2client/tree/master/python)

Cloudera 现在在 hs2 客户端https://github.com/cloudera/hs2client上投入更多精力, 这是一个 C/C++ HiveServer2/Impala 客户端。如果您向/从 python 推送大量数据,这可能是一个更好的选择。(也有 Python 绑定 - https://github.com/cloudera/hs2client/tree/master/python

Some more information on impyla:

关于impyla的更多信息:

回答by Aravind Krishnakumar

This can be a quick hack to connect hive and python,

这可以是连接 hive 和 python 的快速技巧,

from pyhive import hive
cursor = hive.connect('YOUR_HOST_NAME').cursor()
cursor.execute('SELECT * from table_name LIMIT 5',async=True)
print cursor.fetchall()

Output: List of Tuples

输出:元组列表

回答by eycheu

You could use python JayDeBeApi package to create DB-API connection from Hive or Impala JDBC driver and then pass the connection to pandas.read_sql function to return data in pandas dataframe.

您可以使用 python JayDeBeApi 包从 Hive 或 Impala JDBC 驱动程序创建 DB-API 连接,然后将连接传递给 pandas.read_sql 函数以返回熊猫数据帧中的数据。

import jaydebeapi
# Apparently need to load the jar files for the first time for impala jdbc driver to work 
conn = jaydebeapi.connect('com.cloudera.hive.jdbc41.HS2Driver',
['jdbc:hive2://host:10000/db;AuthMech=1;KrbHostFQDN=xxx.com;KrbServiceName=hive;KrbRealm=xxx.COM', "",""],
jars=['/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/HiveJDBC41.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/TCLIServiceClient.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-codec-1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_metastore.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_service.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/log4j-1.2.14.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/ql.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/zookeeper-3.4.6.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ImpalaJDBC41.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/TCLIServiceClient.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-codec-1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_metastore.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_service.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/log4j-1.2.14.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ql.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/zookeeper-3.4.6.jar'
])

# the previous call have initialized the jar files, technically this call needs not include the required jar files
impala_conn = jaydebeapi.connect('com.cloudera.impala.jdbc41.Driver',
['jdbc:impala://host:21050/db;AuthMech=1;KrbHostFQDN=xxx.com;KrbServiceName=impala;KrbRealm=xxx.COM',"",""],
jars=['/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/HiveJDBC41.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/TCLIServiceClient.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-codec-1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_metastore.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_service.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/log4j-1.2.14.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/ql.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/zookeeper-3.4.6.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ImpalaJDBC41.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/TCLIServiceClient.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-codec-1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_metastore.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_service.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/log4j-1.2.14.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ql.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/zookeeper-3.4.6.jar'
])

import pandas as pd
df1 = pd.read_sql("SELECT * FROM tablename", conn)
df2 = pd.read_sql("SELECT * FROM tablename", impala_conn)

conn.close()
impala_conn.close()

回答by Kurt Fehlhauer

To connect using a username/password and specifying ports, the code looks like this:

要使用用户名/密码并指定端口进行连接,代码如下所示:

from pyhive import presto

cursor = presto.connect(host='host.example.com',
                    port=8081,
                    username='USERNAME:PASSWORD').cursor()

sql = 'select * from table limit 10'

cursor.execute(sql)

print(cursor.fetchone())
print(cursor.fetchall())

回答by joceratops

Similar to eycheu's solution, but a little more detailed.

类似于eycheu的解决方案,但更详细一点。

Here is an alternative solution specifically for hive2that does notrequire PyHive or installing system-wide packages. I am working on a linux environment that I do not have root access to so installing the SASL dependencies as mentioned in Tristin's post was not an option for me:

这是一个专门针对 hive2的替代解决方案,不需要 PyHive 或安装系统范围的软件包。我正在一个 linux 环境中工作,我没有 root 访问权限,因此安装 Tristin 的帖子中提到的 SASL 依赖项对我来说不是一个选择:

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution.

如果您使用的是 Linux,则可能需要在运行上述之前单独安装 SASL。使用 apt-get 或 yum 或任何适用于您的发行版的软件包管理器安装软件包 libsasl2-dev。

Specifically, this solution focuses on leveraging the python package: JayDeBeApi. In my experience installing this one extra package on top of a python Anaconda 2.7 install was all I needed. This package leverages java (JDK). I am assuming that is already set up.

具体来说,该解决方案侧重于利用 python 包:JayDeBeApi。根据我的经验,在 python Anaconda 2.7 安装之上安装这个额外的包就是我所需要的。这个包利用了 java (JDK)。我假设已经设置好了。

Step 1: Install JayDeBeApi

第 1 步:安装 JayDeBeApi

pip install jaydebeap

Step 2: Download appropriate drivers for your environment:

第 2 步:下载适合您环境的驱动程序

Store all .jar files in a directory. I will refer to this directory as /path/to/jar/files/.

将所有 .jar 文件存储在一个目录中。我将此目录称为 /path/to/jar/files/。

Step 3: Identify your systems authentication mechanism:

第 3 步:确定您的系统身份验证机制:

In the pyhive solutions listed I've seen PLAIN listed as the authentication mechanism as well as Kerberos. Note that your jdbc connection URL will depend on the authentication mechanism you are using.I will explain Kerberos solutionwithout passing a username/password. Here is more information Kerberos authentication and options.

在列出的 pyhive 解决方案中,我看到 PLAIN 被列为身份验证机制以及 Kerberos。请注意,您的 jdbc 连接 URL 将取决于您使用的身份验证机制。我将在不传递用户名/密码的情况下解释 Kerberos 解决方案以下是 Kerberos 身份验证和选项的更多信息。

Create a Kerberos ticket if one is not already created

如果尚未创建,请创建 Kerberos 票证

$ kinit

Tickets can be viewed via klist.

门票可以通过查看klist

You are now ready to make the connection via python:

您现在已准备好通过 python 建立连接:

import jaydebeapi
import glob
# Creates a list of jar files in the /path/to/jar/files/ directory
jar_files = glob.glob('/path/to/jar/files/*.jar')

host='localhost'
port='10000'
database='default'

# note: your driver will depend on your environment and drivers you've
# downloaded in step 2
# this is the driver for my environment (jdbc3, hive2, cloudera enterprise)
driver='com.cloudera.hive.jdbc3.HS2Driver'

conn_hive = jaydebeapi.connect(driver,
        'jdbc:hive2://'+host+':' +port+'/'+database+';AuthMech=1;KrbHostFQDN='+host+';KrbServiceName=hive'
                           ,jars=jar_files)

If you only care about reading, then you can read it directly into a panda's dataframe with ease via eycheu's solution:

如果您只关心阅读,那么您可以通过 eycheu 的解决方案轻松地将其直接读入熊猫的数据帧:

import pandas as pd
df = pd.read_sql("select * from table", conn_hive)

Otherwise, here is a more versatile communication option:

否则,这里有一个更通用的通信选项:

cursor = conn_hive.cursor()
sql_expression = "select * from table"
cursor.execute(sql_expression)
results = cursor.fetchall()

You could imagine, if you wanted to create a table, you would not need to "fetch" the results, but could submit a create table query instead.

你可以想象,如果你想创建一个表,你不需要“获取”结果,而是可以提交一个创建表查询。