pandas 在 Python 中从 Spark DataFrame 创建标记点

Question

提问by user1518003

What .map()function in python do I use to create a set of labeledPointsfrom a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'?

.map()我使用 python 中的什么函数labeledPoints从 spark 数据帧创建一组？如果标签/结果不是第一列，但我可以参考其列名“状态”，则表示法是什么？

I create the Python dataframe with this .map() function:

我用这个 .map() 函数创建了 Python 数据框：

def parsePoint(line):
    listmp = list(line.split('\t'))
    dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose()
    dataframe.insert(0, 'status', dataframe['accepted'])
    if 'NULL' in dataframe.columns:
        dataframe = dataframe.drop('NULL', axis=1)  
    if '' in dataframe.columns:
        dataframe = dataframe.drop('', axis=1)  
    if 'rejected' in dataframe.columns:
        dataframe = dataframe.drop('rejected', axis=1)  
    if 'accepted' in dataframe.columns:
        dataframe = dataframe.drop('accepted', axis=1)  
    return dataframe

I convert it to a Spark dataframe after the reduce function has recombined all the Pandas dataframes.

在 reduce 函数重新组合所有 Pandas 数据帧后，我将其转换为 Spark 数据帧。

parsedData=sqlContext.createDataFrame(parsedData)

But now how do I create labledPointsfrom this in Python? I assume it may be another .map()function?

但是现在我如何labledPoints在 Python 中从这里创建？我认为它可能是另一个.map()功能？

Answer 1

回答by zero323

If you already have numerical features and which require no additional transformations you can use VectorAssemblerto combine columns containing independent variables:

如果您已经有数值特征并且不需要额外的转换，您可以使用VectorAssembler组合包含自变量的列：

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["your", "independent", "variables"],
    outputCol="features")

transformed = assembler.transform(parsedData)

Next you can simply map:

接下来你可以简单地映射：

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

(transformed.select(col("outcome_column").alias("label"), col("features"))
  .rdd
  .map(lambda row: LabeledPoint(row.label, row.features)))

As of Spark 2.0 mland mllibAPI are no longer compatible and the latter one is going towards deprecation and removal. If you still need this you'll have to convert ml.Vectorsto mllib.Vectors.

从 Spark 2.0 开始ml，mllibAPI 不再兼容，后者将被弃用和删除。如果您仍然需要它，则必须转换ml.Vectors为mllib.Vectors.

from pyspark.mllib import linalg as mllib_linalg
from pyspark.ml import linalg as ml_linalg

def as_old(v):
    if isinstance(v, ml_linalg.SparseVector):
        return mllib_linalg.SparseVector(v.size, v.indices, v.values)
    if isinstance(v, ml_linalg.DenseVector):
        return mllib_linalg.DenseVector(v.values)
    raise ValueError("Unsupported type {0}".format(type(v)))

and map:

和地图：

lambda row: LabeledPoint(row.label, as_old(row.features)))

pandas 在 Python 中从 Spark DataFrame 创建标记点

提问by user1518003

回答by zero323

相关推荐

最近更新

标签

pandas 在 Python 中从 Spark DataFrame 创建标记点

提问by user1518003

回答by zero323

相关推荐

Pandas json_normalize 会产生令人困惑的“KeyError”消息？

pandas 如何在 Scikit-learn Agglomerative clustering 中使用 Pearson Correlation 作为距离度量

pandas 检查头是否存在与 Python 熊猫

pandas Python数据帧在微秒内重新采样

相关推荐

最近更新

标签