pandas 在 Python 中从 Spark DataFrame 创建标记点
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32556178/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create labeledPoints from Spark DataFrame in Python
提问by user1518003
What .map()function in python do I use to create a set of labeledPointsfrom a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'?
.map()我使用 python 中的什么函数labeledPoints从 spark 数据帧创建一组?如果标签/结果不是第一列,但我可以参考其列名“状态”,则表示法是什么?
I create the Python dataframe with this .map() function:
我用这个 .map() 函数创建了 Python 数据框:
def parsePoint(line):
listmp = list(line.split('\t'))
dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose()
dataframe.insert(0, 'status', dataframe['accepted'])
if 'NULL' in dataframe.columns:
dataframe = dataframe.drop('NULL', axis=1)
if '' in dataframe.columns:
dataframe = dataframe.drop('', axis=1)
if 'rejected' in dataframe.columns:
dataframe = dataframe.drop('rejected', axis=1)
if 'accepted' in dataframe.columns:
dataframe = dataframe.drop('accepted', axis=1)
return dataframe
I convert it to a Spark dataframe after the reduce function has recombined all the Pandas dataframes.
在 reduce 函数重新组合所有 Pandas 数据帧后,我将其转换为 Spark 数据帧。
parsedData=sqlContext.createDataFrame(parsedData)
But now how do I create labledPointsfrom this in Python? I assume it may be another .map()function?
但是现在我如何labledPoints在 Python 中从这里创建?我认为它可能是另一个.map()功能?
回答by zero323
If you already have numerical features and which require no additional transformations you can use VectorAssemblerto combine columns containing independent variables:
如果您已经有数值特征并且不需要额外的转换,您可以使用VectorAssembler组合包含自变量的列:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["your", "independent", "variables"],
outputCol="features")
transformed = assembler.transform(parsedData)
Next you can simply map:
接下来你可以简单地映射:
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
(transformed.select(col("outcome_column").alias("label"), col("features"))
.rdd
.map(lambda row: LabeledPoint(row.label, row.features)))
As of Spark 2.0 mland mllibAPI are no longer compatible and the latter one is going towards deprecation and removal. If you still need this you'll have to convert ml.Vectorsto mllib.Vectors.
从 Spark 2.0 开始ml,mllibAPI 不再兼容,后者将被弃用和删除。如果您仍然需要它,则必须转换ml.Vectors为mllib.Vectors.
from pyspark.mllib import linalg as mllib_linalg
from pyspark.ml import linalg as ml_linalg
def as_old(v):
if isinstance(v, ml_linalg.SparseVector):
return mllib_linalg.SparseVector(v.size, v.indices, v.values)
if isinstance(v, ml_linalg.DenseVector):
return mllib_linalg.DenseVector(v.values)
raise ValueError("Unsupported type {0}".format(type(v)))
and map:
和地图:
lambda row: LabeledPoint(row.label, as_old(row.features)))

