Python 连接后如何在 Pyspark 数据框中选择和排序多列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40467449/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:34:51  来源:igfitidea点击:

How to select and order multiple columns in a Pyspark Dataframe after a join

pythonapache-sparkpysparkapache-spark-sql

提问by user3858193

I want to select multiple columns from existing dataframe (which is created after joins) and would like to order the fileds as my target table structure. How can it be done ? The approached I have used is below. Here I am able to select the necessary columns required but not able to make in sequence.

我想从现有数据帧(在连接之后创建)中选择多个列,并希望将这些字段排序为我的目标表结构。怎么做到呢 ?我使用的方法如下。在这里,我可以选择所需的必要列,但无法按顺序进行。

Required (Target Table structure) :
hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag")

account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' )
account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns])

>>> account_sk_df
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int]


>>> account_sk_df_ld
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int]

The account_sk_id need to be in 2nd place. What's the best way to do this ?

account_sk_id 需要排在第二位。这样做的最佳方法是什么?

回答by Mariusz

Try selecting columns by just giving a list, not by iterating existings columns or ordering should be OK:

尝试通过仅提供列表来选择列,而不是通过迭代现有列或排序应该没问题:

account_sk_df_ld = account_sk_df.select(*hist_columns)