Hive 数据到 Pandas 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38218200/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:31:34  来源:igfitidea点击:

Hive Data to Pandas Data frame

pythonpandashadoophive

提问by ankita gupta

Newbie to Python.

Python 新手。

How can i save the data from hive to Pandas data frame.

如何将数据从 hive 保存到 Pandas 数据框。

with pyhs2.connect(host, port=20000,authMechanism="PLAIN",user,password,
               database) as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute(query)

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i
        **columnNames = [a['columnName'] for a in  cur.getSchema()]
        print columnNames
        df1=pd.DataFrame(cur.fetch(),columnNames)**

Tried using column names. Didn't Work.

尝试使用列名。没用。

Pls. suggest something.

请。提出一些建议。

回答by Saftophobia

pd.read_sql()(pandas 0.24.0) takes a DB connection. Use PyHiveconnection directly with pandas.read_sql()as follows:

pd.read_sql()(pandas 0.24.0) 需要一个数据库连接。直接使用PyHive连接pandas.read_sql()如下:

from pyhive import hive
import pandas as pd

# open connection
conn = hive.Connection(host=host,port= 20000, ...)

# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)

Dataframe's columns will be named after the hive table's. One can change them during/after dataframe creation if needed:

Dataframe 的列将以 hive 表的名称命名。如果需要,可以在数据帧创建期间/之后更改它们:

  • via HiveQL: SELECT id AS new_column_name ...
  • via columns attribute in pd.read_sql()
  • 通过 HiveQL: SELECT id AS new_column_name ...
  • 通过列属性 pd.read_sql()

回答by ML_Passion

You can try this: ( I'm pretty sure it will work)

你可以试试这个:(我很确定它会起作用)

res = cur.getSchema()
description = list(col['columnName'] for col in res)  ## for getting the column names of the table 

headers = [x.split(".")[1] for x in description] # for splitting the list if the column name contains a period

df= pd.DataFrame(cur.fetchall(), columns = headers)

df.head(n = 20)

回答by ankita gupta

As I had fetched data before and was trying to fetch again, so was getting empty Data Frame.

由于我之前已获取数据并尝试再次获取数据,因此获取空数据帧。

cur.execute(query)
val=cur.fetchall()
columnNames = [a['columnName'] for a in  cur.getSchema()]
df=pd.DataFrame(data=val,columns=columnNames)
#print df
return df