Python 从火花数据框中取出 n 行并传递给 toPandas()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40537782/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Take n rows from a spark dataframe and pass to toPandas()
提问by jamiet
I have this code:
我有这个代码:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).toPandas()
Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas()
to return a pandas dataframe. How do I do it? I can't call take(n)
because that doesn't return a dataframe and thus I can't pass it to toPandas()
.
工作正常,做它需要做的事。假设我只想显示前 n 行,然后调用toPandas()
返回一个 Pandas 数据帧。我该怎么做?我无法调用,take(n)
因为它不返回数据帧,因此我无法将其传递给toPandas()
.
So to put it another way, how can I take the top n rows from a dataframe and call toPandas()
on the resulting dataframe? Can't think this is difficult but I can't figure it out.
换句话说,如何从数据帧中取出前 n 行并调用toPandas()
结果数据帧?不能认为这很难,但我想不通。
I'm using Spark 1.6.0.
我正在使用 Spark 1.6.0。
回答by Neo
You can use the limit(n)
function:
您可以使用该limit(n)
功能:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.limit(2).withColumn('age2', df.age + 2).toPandas()
Or:
或者:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).limit(2).toPandas()
回答by Anton Protopopov
You could get first rows of Spark DataFrame with headand then create Pandas DataFrame:
您可以使用head获取 Spark DataFrame 的第一行,然后创建 Pandas DataFrame:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df_pandas = pd.DataFrame(df.head(3), columns=df.columns)
In [4]: df_pandas
Out[4]:
name age
0 Alice 1
1 Jim 2
2 Sandra 3
回答by prossblad
Try it:
尝试一下:
def showDf(df, count=None, percent=None, maxColumns=0):
if (df == None): return
import pandas
from IPython.display import display
pandas.set_option('display.encoding', 'UTF-8')
# Pandas dataframe
dfp = None
# maxColumns param
if (maxColumns >= 0):
if (maxColumns == 0): maxColumns = len(df.columns)
pandas.set_option('display.max_columns', maxColumns)
# count param
if (count == None and percent == None): count = 10 # Default count
if (count != None):
count = int(count)
if (count == 0): count = df.count()
pandas.set_option('display.max_rows', count)
dfp = pandas.DataFrame(df.head(count), columns=df.columns)
display(dfp)
# percent param
elif (percent != None):
percent = float(percent)
if (percent >=0.0 and percent <= 1.0):
import datetime
now = datetime.datetime.now()
seed = long(now.strftime("%H%M%S"))
dfs = df.sample(False, percent, seed)
count = df.count()
pandas.set_option('display.max_rows', count)
dfp = dfs.toPandas()
display(dfp)
Examples of usages are:
用法示例如下:
# Shows the ten first rows of the Spark dataframe
showDf(df)
showDf(df, 10)
showDf(df, count=10)
# Shows a random sample which represents 15% of the Spark dataframe
showDf(df, percent=0.15)