pyspark 中的 Pandas 数据框到 hive

Question

提问by thenakulchawla

How to send a pandas dataframe to a hive table?

如何将Pandas数据帧发送到配置单元表？

I know if I have a spark dataframe, I can register it to a temporary table using

我知道如果我有一个火花数据框，我可以使用它注册到一个临时表

df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")

but when I try to use the pandas dataFrame to registerTempTable, I get the below error:

但是当我尝试使用 Pandas dataFrame 来 registerTempTable 时，我收到以下错误：

AttributeError: 'DataFrame' object has no attribute 'registerTempTable'

Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.

有没有办法让我使用Pandas数据帧来注册临时表或将其转换为火花数据帧，然后使用它注册临时表，以便我可以将其发送回配置单元。

Answer 1

采纳答案by MaxU

I guess you are trying to use pandas dfinstead of Spark's DF.

我猜您正在尝试使用 pandasdf而不是Spark 的 DF。

Pandas DataFrame has no such method as registerTempTable.

Pandas DataFrame 没有像registerTempTable.

you may try to create Spark DF from pandas DF.

您可以尝试从 Pandas DF 创建 Spark DF。

UPDATE:

更新：

I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).

我已经在 Cloudera 下测试过它（安装了Anaconda 包裹，其中包括 Pandas 模块）。

Make sure that you have set PYSPARK_PYTHONto your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)

确保您已PYSPARK_PYTHON在所有 Spark 工作线程（通常在 : spark-conf/spark-env.sh）上设置了 anaconda python 安装（或另一个包含 Pandas 模块的安装）

Here is result of my test:

这是我的测试结果：

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
|  A|  B|  C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+

>>> sdf.printSchema()
root
 |-- A: long (nullable = true)
 |-- B: long (nullable = true)
 |-- C: long (nullable = true)

Answer 2

回答by Ming.Xu

first u need to convert pandas dataframe to spark dataframe:

首先你需要将Pandas数据帧转换为火花数据帧：

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)

then u can create a temptable which is in memory:

然后你可以创建一个内存中的临时表：

df.registerTempTable('tmp')

now,u can use hive ql to save data into hive:

现在，您可以使用 hive ql 将数据保存到 hive 中：

hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''

note than:the hive_context must be keep to the same one!

注意：hive_context 必须保持相同！

Answer 3

回答by Abhi

I converted my pandas df to a temp table by

我将我的Pandas df 转换为临时表

1) Converting the pandas dataframe to spark dataframe:

1）将Pandas数据帧转换为火花数据帧：

spark_df=sqlContext.createDataFrame(Pandas_df)

2) Make sure that the data is migratedproperly

2）确保数据迁移正确

spark_df.select("*").show()

3) Convert the spark dataframe to a temp table for querying.

3）将spark数据帧转换成临时表进行查询。

spark_df.registerTempTable("table_name").

Cheers..

干杯..

pyspark 中的 Pandas 数据框到 hive

提问by thenakulchawla

采纳答案by MaxU

回答by Ming.Xu

回答by Abhi

相关推荐

最近更新

标签

pyspark 中的 Pandas 数据框到 hive

提问by thenakulchawla

采纳答案by MaxU

回答by Ming.Xu

回答by Abhi

相关推荐

pandas 使用 python 中的矢量化解决方案计算最大回撤

使用 Pandas Value_Counts 和 matplotlib

如何创建超过 2 个维度的 Pandas 数据框？

Pandas DateOffset，倒退一天

相关推荐

最近更新

标签