Python Pyspark:以表格格式显示火花数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39067505/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark: display a spark data frame in a table format
提问by Edamame
I am using pyspark to read a parquet file like below:
我正在使用 pyspark 读取如下所示的镶木地板文件:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Then when I do my_df.take(5)
, it will show [Row(...)]
, instead of a table format like when we use the pandas data frame.
然后当我这样做时my_df.take(5)
,它会显示[Row(...)]
,而不是像我们使用熊猫数据框时那样的表格格式。
Is it possible to display the data frame in a table format like pandas data frame? Thanks!
是否可以像 Pandas 数据框那样以表格格式显示数据框?谢谢!
回答by eddies
The showmethod does what you're looking for.
该节目的方法做,你在找什么。
For example, given the following dataframe of 3 rows, I can print just the first two rows like this:
例如,给定以下 3 行数据框,我可以只打印前两行,如下所示:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)
which yields:
产生:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
+---+---+
only showing top 2 rows
回答by Louis Yang
As mentioned by @Brent in the comment of @maxymoo's answer, you can try
正如@Brent 在@maxymoo 回答的评论中提到的,你可以试试
df.limit(10).toPandas()
to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit()
will not keep the order of original spark dataframe.
在 Jupyter 中获得更漂亮的表格。但是,如果您没有缓存 spark 数据帧,这可能需要一些时间才能运行。此外,.limit()
不会保持原始火花数据帧的顺序。
回答by maxymoo
Yes: call the toPandas
method on your dataframe and you'll get an actualpandas dataframe !
是的:toPandas
在您的数据帧上调用该方法,您将获得一个实际的熊猫数据帧!
回答by Giorgos Myrianthous
Let's say we have the following Spark DataFrame:
假设我们有以下 Spark DataFrame:
df = sqlContext.createDataFrame([(1, "Mark", "Brown"), (2, "Tom", "Anderson"), (3, "Joshua", "Peterson")], ('id', 'firstName', 'lastName'))
There are typically three different ways you can use to print the content of the dataframe:
通常可以使用三种不同的方式来打印数据框的内容:
Print Spark DataFrame
打印 Spark 数据帧
The most common way is to use the show()
function:
最常用的方法是使用show()
函数:
>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Print Spark DataFrame vertically
垂直打印 Spark DataFrame
Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.
假设您有相当多的列并且您的数据框不适合屏幕。您可以垂直打印行 - 例如,以下命令将垂直打印前两行,不会被截断。
>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
id | 1
firstName | Mark
lastName | Brown
-RECORD 1-------------
id | 2
firstName | Tom
lastName | Anderson
only showing top 2 rows
Convert to Pandas and print Pandas DataFrame
转换为 Pandas 并打印 Pandas DataFrame
Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas()
and finally print()
it.
或者,您可以使用.toPandas()
并最终将 Spark DataFrame 转换为 Pandas DataFrame print()
。
>>> df_pd = df.toPandas()
>>> print(df_pd)
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory.
请注意,当您必须处理相当大的数据帧时,不建议这样做,因为 Pandas 需要将所有数据加载到内存中。