Python 连接后如何在 Pyspark 数据框中选择和排序多列

Question

提问by user3858193

I want to select multiple columns from existing dataframe (which is created after joins) and would like to order the fileds as my target table structure. How can it be done ? The approached I have used is below. Here I am able to select the necessary columns required but not able to make in sequence.

我想从现有数据帧（在连接之后创建）中选择多个列，并希望将这些字段排序为我的目标表结构。怎么做到呢？我使用的方法如下。在这里，我可以选择所需的必要列，但无法按顺序进行。

Required (Target Table structure) :
hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag")

account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' )
account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns])

>>> account_sk_df
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int]


>>> account_sk_df_ld
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int]

The account_sk_id need to be in 2nd place. What's the best way to do this ?

account_sk_id 需要排在第二位。这样做的最佳方法是什么？

Answer 1

回答by Mariusz

Try selecting columns by just giving a list, not by iterating existings columns or ordering should be OK:

尝试通过仅提供列表来选择列，而不是通过迭代现有列或排序应该没问题：

account_sk_df_ld = account_sk_df.select(*hist_columns)

Python 连接后如何在 Pyspark 数据框中选择和排序多列

提问by user3858193

回答by Mariusz

相关推荐

最近更新

标签

Python 连接后如何在 Pyspark 数据框中选择和排序多列

提问by user3858193

回答by Mariusz

相关推荐

Python 如何更改熊猫数据框中的单个索引值？

Python 访问使用 ElementTree 解析的 xml 文件中的嵌套子项

Python TensorFlow ValueError：无法为形状为“(?, 64, 64, 3)”的张量 u'Placeholder:0' 提供形状 (64, 64, 3) 的值

Pythonic way to convert a dictionary into namedtuple or another hashable dict-like?

相关推荐

最近更新

标签