pyspark 中的 Pandas 数据框到 hive
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36919825/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe in pyspark to hive
提问by thenakulchawla
How to send a pandas dataframe to a hive table?
如何将Pandas数据帧发送到配置单元表?
I know if I have a spark dataframe, I can register it to a temporary table using
我知道如果我有一个火花数据框,我可以使用它注册到一个临时表
df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")
but when I try to use the pandas dataFrame to registerTempTable, I get the below error:
但是当我尝试使用 Pandas dataFrame 来 registerTempTable 时,我收到以下错误:
AttributeError: 'DataFrame' object has no attribute 'registerTempTable'
Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.
有没有办法让我使用Pandas数据帧来注册临时表或将其转换为火花数据帧,然后使用它注册临时表,以便我可以将其发送回配置单元。
采纳答案by MaxU
I guess you are trying to use pandas df
instead of Spark's DF.
我猜您正在尝试使用 pandasdf
而不是Spark 的 DF。
Pandas DataFrame has no such method as registerTempTable
.
Pandas DataFrame 没有像registerTempTable
.
you may try to create Spark DF from pandas DF.
您可以尝试从 Pandas DF 创建 Spark DF。
UPDATE:
更新:
I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).
我已经在 Cloudera 下测试过它(安装了Anaconda 包裹,其中包括 Pandas 模块)。
Make sure that you have set PYSPARK_PYTHON
to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh
)
确保您已PYSPARK_PYTHON
在所有 Spark 工作线程(通常在 : spark-conf/spark-env.sh
)上设置了 anaconda python 安装(或另一个包含 Pandas 模块的安装)
Here is result of my test:
这是我的测试结果:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+
>>> sdf.printSchema()
root
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- C: long (nullable = true)
回答by Ming.Xu
first u need to convert pandas dataframe to spark dataframe:
首先你需要将Pandas数据帧转换为火花数据帧:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)
then u can create a temptable which is in memory:
然后你可以创建一个内存中的临时表:
df.registerTempTable('tmp')
now,u can use hive ql to save data into hive:
现在,您可以使用 hive ql 将数据保存到 hive 中:
hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''
note than:the hive_context must be keep to the same one!
注意:hive_context 必须保持相同!
回答by Abhi
I converted my pandas df to a temp table by
我将我的Pandas df 转换为临时表
1) Converting the pandas dataframe to spark dataframe:
1)将Pandas数据帧转换为火花数据帧:
spark_df=sqlContext.createDataFrame(Pandas_df)
2) Make sure that the data is migratedproperly
2)确保数据迁移正确
spark_df.select("*").show()
3) Convert the spark dataframe to a temp table for querying.
3)将spark数据帧转换成临时表进行查询。
spark_df.registerTempTable("table_name").
Cheers..
干杯..