Python Pyspark：以表格格式显示火花数据框

Question

提问by Edamame

I am using pyspark to read a parquet file like below:

我正在使用 pyspark 读取如下所示的镶木地板文件：

my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')

Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.

然后当我这样做时my_df.take(5)，它会显示[Row(...)]，而不是像我们使用熊猫数据框时那样的表格格式。

Is it possible to display the data frame in a table format like pandas data frame? Thanks!

是否可以像 Pandas 数据框那样以表格格式显示数据框？谢谢！

Answer 1

回答by eddies

The showmethod does what you're looking for.

该节目的方法做，你在找什么。

For example, given the following dataframe of 3 rows, I can print just the first two rows like this:

例如，给定以下 3 行数据框，我可以只打印前两行，如下所示：

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)

which yields:

产生：

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
+---+---+
only showing top 2 rows

Answer 2

回答by Louis Yang

As mentioned by @Brent in the comment of @maxymoo's answer, you can try

正如@Brent 在@maxymoo 回答的评论中提到的，你可以试试

df.limit(10).toPandas()

to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit()will not keep the order of original spark dataframe.

在 Jupyter 中获得更漂亮的表格。但是，如果您没有缓存 spark 数据帧，这可能需要一些时间才能运行。此外，.limit()不会保持原始火花数据帧的顺序。

Answer 3

回答by maxymoo

Yes: call the toPandasmethod on your dataframe and you'll get an actualpandas dataframe !

是的：toPandas在您的数据帧上调用该方法，您将获得一个实际的熊猫数据帧！

Answer 4

回答by Giorgos Myrianthous

Let's say we have the following Spark DataFrame:

假设我们有以下 Spark DataFrame：

df = sqlContext.createDataFrame([(1, "Mark", "Brown"), (2, "Tom", "Anderson"), (3, "Joshua", "Peterson")], ('id', 'firstName', 'lastName'))

There are typically three different ways you can use to print the content of the dataframe:

通常可以使用三种不同的方式来打印数据框的内容：

Print Spark DataFrame

打印 Spark 数据帧

The most common way is to use the show()function:

最常用的方法是使用show()函数：

>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Print Spark DataFrame vertically

垂直打印 Spark DataFrame

Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.

假设您有相当多的列并且您的数据框不适合屏幕。您可以垂直打印行 - 例如，以下命令将垂直打印前两行，不会被截断。

>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
 id        | 1        
 firstName | Mark     
 lastName  | Brown    
-RECORD 1-------------
 id        | 2        
 firstName | Tom      
 lastName  | Anderson 
only showing top 2 rows

Convert to Pandas and print Pandas DataFrame

转换为 Pandas 并打印 Pandas DataFrame

Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas()and finally print()it.

或者，您可以使用.toPandas()并最终将 Spark DataFrame 转换为 Pandas DataFrame print()。

>>> df_pd = df.toPandas()
>>> print(df_pd)
   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory.

请注意，当您必须处理相当大的数据帧时，不建议这样做，因为 Pandas 需要将所有数据加载到内存中。

Python Pyspark：以表格格式显示火花数据框

提问by Edamame

回答by eddies

回答by Louis Yang

回答by maxymoo

回答by Giorgos Myrianthous

相关推荐

最近更新

标签

Python Pyspark：以表格格式显示火花数据框

提问by Edamame

回答by eddies

回答by Louis Yang

回答by maxymoo

回答by Giorgos Myrianthous

相关推荐

如何pip安装本地python包？

Python Tensorflow：Cuda 计算能力 3.0。最低要求的 Cuda 能力是 3.5

Python 为什么这段代码不能在连续值 Pandas 列上绘制直方图？

Python 高通滤波器

相关推荐

最近更新

标签