Python 将 StringIndexer 应用于 PySpark 数据帧中的几列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36942233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply StringIndexer to several columns in a PySpark Dataframe
提问by Ivan
I have a PySpark dataframe
我有一个 PySpark 数据框
+-------+--------------+----+----+
|address| date|name|food|
+-------+--------------+----+----+
|1111111|20151122045510| Yin|gre |
|1111111|20151122045501| Yin|gre |
|1111111|20151122045500| Yln|gra |
|1111112|20151122065832| Yun|ddd |
|1111113|20160101003221| Yan|fdf |
|1111111|20160703045231| Yin|gre |
|1111114|20150419134543| Yin|fdf |
|1111115|20151123174302| Yen|ddd |
|2111115| 20123192| Yen|gre |
+-------+--------------+----+----+
that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:
我想转换为与 pyspark.ml 一起使用。我可以使用 StringIndexer 将名称列转换为数字类别:
indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df)
df_ind = indexer.transform(df)
df_ind.show()
+-------+--------------+----+----------+----+
|address| date|name|name_index|food|
+-------+--------------+----+----------+----+
|1111111|20151122045510| Yin| 0.0|gre |
|1111111|20151122045501| Yin| 0.0|gre |
|1111111|20151122045500| Yln| 2.0|gra |
|1111112|20151122065832| Yun| 4.0|ddd |
|1111113|20160101003221| Yan| 3.0|fdf |
|1111111|20160703045231| Yin| 0.0|gre |
|1111114|20150419134543| Yin| 0.0|fdf |
|1111115|20151123174302| Yen| 1.0|ddd |
|2111115| 20123192| Yen| 1.0|gre |
+-------+--------------+----+----------+----+
How can I transform several columns with StringIndexer (for example, name
and food
, each with its own StringIndexer
) and then use VectorAssemblerto generate a feature vector? Or do I have to create a StringIndexer
for each column?
如何使用 StringIndexer 转换多列(例如,name
and food
,每个列都有自己的StringIndexer
),然后使用VectorAssembler生成特征向量?还是我必须StringIndexer
为每一列创建一个?
** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer
or VectorAssembler
because the columns are not numerical.
** 编辑 **:这不是一个骗局,因为我需要以编程方式针对具有不同列名的几个数据框进行此操作。我不能使用VectorIndexer
或VectorAssembler
因为列不是数字。
** EDIT 2**: A tentative solution is
** 编辑 2**:暂定的解决方案是
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]
where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.
我现在用三个数据框创建一个列表,每个数据框都与原始数据框和转换后的列相同。现在我需要加入然后形成最终的数据帧,但这非常低效。
回答by Ivan
The best way that I've found to do it is to combine several StringIndex
on a list and use a Pipeline
to execute them all:
我发现最好的方法是将几个组合StringIndex
在一个列表中并使用 aPipeline
来执行它们:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
+-------+--------------+----+----+----------+----------+-------------+
|address| date|food|name|food_index|name_index|address_index|
+-------+--------------+----+----+----------+----------+-------------+
|1111111|20151122045510| gre| Yin| 0.0| 0.0| 0.0|
|1111111|20151122045501| gra| Yin| 2.0| 0.0| 0.0|
|1111111|20151122045500| gre| Yln| 0.0| 2.0| 0.0|
|1111112|20151122065832| gre| Yun| 0.0| 4.0| 3.0|
|1111113|20160101003221| gre| Yan| 0.0| 3.0| 1.0|
|1111111|20160703045231| gre| Yin| 0.0| 0.0| 0.0|
|1111114|20150419134543| gre| Yin| 0.0| 0.0| 5.0|
|1111115|20151123174302| ddd| Yen| 1.0| 1.0| 2.0|
|2111115| 20123192| ddd| Yen| 1.0| 1.0| 4.0|
+-------+--------------+----+----+----------+----------+-------------+
回答by Horbaje
I can offer you the following solution. It is better to use pipelines for these kind of transformations on larger data sets. They also make your code a lot easier to follow and understand. You can add more stages to the pipelines if you need. For example add an encoder.
我可以为您提供以下解决方案。最好在较大的数据集上使用管道进行此类转换。它们还使您的代码更易于遵循和理解。如果需要,您可以向管道添加更多阶段。例如添加一个编码器。
#create a list of the columns that are string typed
categoricalColumns = [item[0] for item in df.dtypes if item[1].startswith('string') ]
#define a list of stages in your pipeline. The string indexer will be one stage
stages = []
#iterate through all categorical values
for categoricalCol in categoricalColumns:
#create a string indexer for those categorical values and assign a new name including the word 'Index'
stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
#append the string Indexer to our list of stages
stages += [stringIndexer]
#Create the pipeline. Assign the satges list to the pipeline key word stages
pipeline = Pipeline(stages = stages)
#fit the pipeline to our dataframe
pipelineModel = pipeline.fit(df)
#transform the dataframe
df= pipelineModel.transform(df)
Please have a look at my reference
请看看我的参考