在 Spark 中使用 Pandas

Question

提问by Zop

I have a Noob Question on spark and pandas. I would like to use pandas, numpy etc.. with spark but when i import a lib i have an error. can you help me plz? This is my code

我有一个关于 spark 和 pandas 的菜鸟问题。我想在 spark 中使用 pandas、numpy 等，但是当我导入 lib 时出现错误。你能帮我吗？这是我的代码

from pyspark import SparkContext, SQLContext
from pyspark import SparkConf
import pandas

# Config
conf = SparkConf().setAppName("Script")
sc = SparkContext(conf=conf)
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
sqlCtx = SQLContext(sc)

# Importation of csv out of HDFS
data_name = "file_on_hdfs.csv"
data_textfile = sc.textFile(data_name)

This is the error:

这是错误：

ImportError: No module named pandas

How can i use pandas? It's not a local mode.

我如何使用Pandas？这不是本地模式。

Answer 1

回答by AndreyF

Spark has it's own Dataframeobject that can be created from RDDs.

Spark 有它自己的Dataframe对象，可以从 RDD 中创建。

You can still use libraries such as numpy but you must install them first.

您仍然可以使用诸如 numpy 之类的库，但您必须先安装它们。

Answer 2

回答by Beyhan Gül

You can use Apache Arrow for this problem.

您可以使用 Apache Arrow 解决此问题。

Apache Arrow

阿帕奇箭

It's initial version but will be more powerful in future(will see).

它是初始版本，但将来会更强大（会看到）。

For installation: click

安装：点击

Answer 3

回答by Abu Tahir

Check whether you have pandas installed in your box with pip list|grep 'pandas'command in a terminal.If you have a match then do a apt-get update. If you are using multi node cluster , yes you need to install pandas in all the client box.

pip list|grep 'pandas'使用终端中的命令检查您的盒子中是否安装了 Pandas。如果您有匹配项，则执行apt-get update. 如果您使用的是多节点集群，是的，您需要在所有客户端中安装 Pandas。

Better to try spark version of DataFrame, but if you still like to use pandas the above method would work

最好尝试 Spark 版本的 DataFrame，但如果您仍然喜欢使用 Pandas，则上述方法会起作用

在 Spark 中使用 Pandas

提问by Zop

回答by AndreyF

回答by Beyhan Gül

回答by Abu Tahir

相关推荐

最近更新

标签

在 Spark 中使用 Pandas

提问by Zop

回答by AndreyF

回答by Beyhan Gül

回答by Abu Tahir

相关推荐

pandas 将数据框列名称从字符串格式更改为日期时间

pandas 如何通过pandas中的groupby输出填充？

pandas 中的 read_table 和 read_csv 有区别吗？

pandas isnull sum 与列标题

相关推荐

最近更新

标签