Python Pyspark:以表格格式显示火花数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39067505/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:51:30  来源:igfitidea点击:

Pyspark: display a spark data frame in a table format

pythonpandaspysparkspark-dataframe

提问by Edamame

I am using pyspark to read a parquet file like below:

我正在使用 pyspark 读取如下所示的镶木地板文件:

my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')

Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.

然后当我这样做时my_df.take(5),它会显示[Row(...)],而不是像我们使用熊猫数据框时那样的表格格式。

Is it possible to display the data frame in a table format like pandas data frame? Thanks!

是否可以像 Pandas 数据框那样以表格格式显示数据框?谢谢!

回答by eddies

The showmethod does what you're looking for.

节目的方法做,你在找什么。

For example, given the following dataframe of 3 rows, I can print just the first two rows like this:

例如,给定以下 3 行数据框,我可以只打印前两行,如下所示:

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)

which yields:

产生:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
+---+---+
only showing top 2 rows

回答by Louis Yang

As mentioned by @Brent in the comment of @maxymoo's answer, you can try

正如@Brent 在@maxymoo 回答的评论中提到的,你可以试试

df.limit(10).toPandas()

to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit()will not keep the order of original spark dataframe.

在 Jupyter 中获得更漂亮的表格。但是,如果您没有缓存 spark 数据帧,这可能需要一些时间才能运行。此外,.limit()不会保持原始火花数据帧的顺序。

回答by maxymoo

Yes: call the toPandasmethod on your dataframe and you'll get an actualpandas dataframe !

是的:toPandas在您的数据帧上调用该方法,您将获得一个实际的熊猫数据帧!

回答by Giorgos Myrianthous

Let's say we have the following Spark DataFrame:

假设我们有以下 Spark DataFrame:

df = sqlContext.createDataFrame([(1, "Mark", "Brown"), (2, "Tom", "Anderson"), (3, "Joshua", "Peterson")], ('id', 'firstName', 'lastName'))

There are typically three different ways you can use to print the content of the dataframe:

通常可以使用三种不同的方式来打印数据框的内容:

Print Spark DataFrame

打印 Spark 数据帧

The most common way is to use the show()function:

最常用的方法是使用show()函数:

>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Print Spark DataFrame vertically

垂直打印 Spark DataFrame

Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.

假设您有相当多的列并且您的数据框不适合屏幕。您可以垂直打印行 - 例如,以下命令将垂直打印前两行,不会被截断。

>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
 id        | 1        
 firstName | Mark     
 lastName  | Brown    
-RECORD 1-------------
 id        | 2        
 firstName | Tom      
 lastName  | Anderson 
only showing top 2 rows

Convert to Pandas and print Pandas DataFrame

转换为 Pandas 并打印 Pandas DataFrame

Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas()and finally print()it.

或者,您可以使用.toPandas()并最终将 Spark DataFrame 转换为 Pandas DataFrame print()

>>> df_pd = df.toPandas()
>>> print(df_pd)
   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory.

请注意,当您必须处理相当大的数据帧时,不建议这样做,因为 Pandas 需要将所有数据加载到内存中