pyspark 中的 Pandas 数据框到 hive

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36919825/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:08:33  来源:igfitidea点击:

Pandas dataframe in pyspark to hive

python-2.7pandashivepyspark

提问by thenakulchawla

How to send a pandas dataframe to a hive table?

如何将Pandas数据帧发送到配置单元表?

I know if I have a spark dataframe, I can register it to a temporary table using

我知道如果我有一个火花数据框,我可以使用它注册到一个临时表

df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")

but when I try to use the pandas dataFrame to registerTempTable, I get the below error:

但是当我尝试使用 Pandas dataFrame 来 registerTempTable 时,我收到以下错误:

AttributeError: 'DataFrame' object has no attribute 'registerTempTable'

Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.

有没有办法让我使用Pandas数据帧来注册临时表或将其转换为火花数据帧,然后使用它注册临时表,以便我可以将其发送回配置单元。

采纳答案by MaxU

I guess you are trying to use pandas dfinstead of Spark's DF.

我猜您正在尝试使用 pandasdf而不是Spark 的 DF

Pandas DataFrame has no such method as registerTempTable.

Pandas DataFrame 没有像registerTempTable.

you may try to create Spark DF from pandas DF.

您可以尝试从 Pandas DF 创建 Spark DF。

UPDATE:

更新:

I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).

我已经在 Cloudera 下测试过它(安装了Anaconda 包裹,其中包括 Pandas 模块)。

Make sure that you have set PYSPARK_PYTHONto your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)

确保您已PYSPARK_PYTHON在所有 Spark 工作线程(通常在 : spark-conf/spark-env.sh)上设置了 anaconda python 安装(或另一个包含 Pandas 模块的安装)

Here is result of my test:

这是我的测试结果:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
|  A|  B|  C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+

>>> sdf.printSchema()
root
 |-- A: long (nullable = true)
 |-- B: long (nullable = true)
 |-- C: long (nullable = true)

回答by Ming.Xu

first u need to convert pandas dataframe to spark dataframe:

首先你需要将Pandas数据帧转换为火花数据帧:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)

then u can create a temptable which is in memory:

然后你可以创建一个内存中的临时表:

df.registerTempTable('tmp')

now,u can use hive ql to save data into hive:

现在,您可以使用 hive ql 将数据保存到 hive 中:

hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''

note than:the hive_context must be keep to the same one!

注意:hive_context 必须保持相同!

回答by Abhi

I converted my pandas df to a temp table by

我将我的Pandas df 转换为临时表

1) Converting the pandas dataframe to spark dataframe:

1)将Pandas数据帧转换为火花数据帧:

spark_df=sqlContext.createDataFrame(Pandas_df)

2) Make sure that the data is migratedproperly

2)确保数据迁移正确

spark_df.select("*").show()

3) Convert the spark dataframe to a temp table for querying.

3)将spark数据帧转换成临时表进行查询。

spark_df.registerTempTable("table_name").

Cheers..

干杯..