Python 从火花数据框中取出 n 行并传递给 toPandas()

Question

提问by jamiet

I have this code:

我有这个代码：

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).toPandas()

Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas()to return a pandas dataframe. How do I do it? I can't call take(n)because that doesn't return a dataframe and thus I can't pass it to toPandas().

工作正常，做它需要做的事。假设我只想显示前 n 行，然后调用toPandas()返回一个 Pandas 数据帧。我该怎么做？我无法调用，take(n)因为它不返回数据帧，因此我无法将其传递给toPandas().

So to put it another way, how can I take the top n rows from a dataframe and call toPandas()on the resulting dataframe? Can't think this is difficult but I can't figure it out.

换句话说，如何从数据帧中取出前 n 行并调用toPandas()结果数据帧？不能认为这很难，但我想不通。

I'm using Spark 1.6.0.

我正在使用 Spark 1.6.0。

Answer 1

回答by Neo

You can use the limit(n)function:

您可以使用该limit(n)功能：

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.limit(2).withColumn('age2', df.age + 2).toPandas()

Or:

或者：

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).limit(2).toPandas()

Answer 2

回答by Anton Protopopov

You could get first rows of Spark DataFrame with headand then create Pandas DataFrame:

您可以使用head获取 Spark DataFrame 的第一行，然后创建 Pandas DataFrame：

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])

df_pandas = pd.DataFrame(df.head(3), columns=df.columns)

In [4]: df_pandas
Out[4]: 
     name  age
0   Alice    1
1     Jim    2
2  Sandra    3

Answer 3

回答by prossblad

Try it:

尝试一下：

def showDf(df, count=None, percent=None, maxColumns=0):
    if (df == None): return
    import pandas
    from IPython.display import display
    pandas.set_option('display.encoding', 'UTF-8')
    # Pandas dataframe
    dfp = None
    # maxColumns param
    if (maxColumns >= 0):
        if (maxColumns == 0): maxColumns = len(df.columns)
        pandas.set_option('display.max_columns', maxColumns)
    # count param
    if (count == None and percent == None): count = 10 # Default count
    if (count != None):
        count = int(count)
        if (count == 0): count = df.count()
        pandas.set_option('display.max_rows', count)
        dfp = pandas.DataFrame(df.head(count), columns=df.columns)
        display(dfp)
    # percent param
    elif (percent != None):
        percent = float(percent)
        if (percent >=0.0 and percent <= 1.0):
            import datetime
            now = datetime.datetime.now()
            seed = long(now.strftime("%H%M%S"))
            dfs = df.sample(False, percent, seed)
            count = df.count()
            pandas.set_option('display.max_rows', count)
            dfp = dfs.toPandas()    
            display(dfp)

Examples of usages are:

用法示例如下：

# Shows the ten first rows of the Spark dataframe
showDf(df)
showDf(df, 10)
showDf(df, count=10)

# Shows a random sample which represents 15% of the Spark dataframe
showDf(df, percent=0.15)

Python 从火花数据框中取出 n 行并传递给 toPandas()

提问by jamiet

回答by Neo

回答by Anton Protopopov

回答by prossblad

相关推荐

最近更新

标签

Python 从火花数据框中取出 n 行并传递给 toPandas()

提问by jamiet

回答by Neo

回答by Anton Protopopov

回答by prossblad

相关推荐

Python 连接后如何在 Pyspark 数据框中选择和排序多列

Python 支持 argparse 中的枚举参数

Python 按索引合并两个数据帧

Python 更改 matplotlib 中日期时间轴的格式

相关推荐

最近更新

标签