pandas Pyspark .toPandas() 导致对象列中预期的数字为 1

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33481572/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:08:55  来源:igfitidea点击:

Pyspark .toPandas() results in object column where expected numeric one

pythonpandasapache-sparkparquet

提问by Geoffrey Stoel

I extact data from our datawarehouse, store this in a parquet file and load all the parquet files into a spark dataframe. So far so good. However when I try to plot this using pandas.plot() function it throws me a "TypeError: Empty 'DataFrame': no numeric data to plot"

我从我们的数据仓库中提取数据,将其存储在镶木地板文件中,并将所有镶木地板文件加载到 spark 数据框中。到现在为止还挺好。但是,当我尝试使用 pandas.plot() 函数绘制它时,它会向我抛出“TypeError: Empty 'DataFrame': no numeric data to plot”

So I started investigating backwards to my source and I think the cast to decimal from my initial sql statement is one of the issues. But I have no clue how to fix this. I thought a fillna(0) would do the trick, but it doesn't.

所以我开始向后调查我的来源,我认为从我的初始 sql 语句转换为十进制是问题之一。但我不知道如何解决这个问题。我认为 fillna(0) 可以解决问题,但事实并非如此。

STEP 1: Define the SQL statement to extract the data

STEP 1:定义提取数据的SQL语句

mpr_sql = """
select 
CAST(DATE_KEY  AS INTEGER) AS DATE_KEY ,
CAST(AMD  AS INTEGER) AS AMD ,
CAST(AMD_2  AS DECIMAL(12,2)) AS AMD_2 ,
CAST(AMD_3  AS DECIMAL(12,2)) AS AMD_3 ,
CAST(AMD_4  AS DECIMAL(12,2)) AS AMD_4 ,
CAST(AMD_0  AS DECIMAL(12,2)) AS AMD_0 
"""

STEP 2: Create a spark dataframe from the extracted data

第 2 步:从提取的数据中创建一个 spark 数据框

df1 = sqlContext.load(source="jdbc", 
                         driver="com.teradata.jdbc.TeraDriver", 
                         url=db_url,
                         user=db_user
                         TMODE="TERA",
                         password=db_pwd,
                         dbtable="( "+sql+") a")

STEP 3: Store the spark dataframe in a parquet file with 10 partitions

第 3 步:将 spark 数据帧存储在具有 10 个分区的镶木地板文件中

df1.coalesce(10).write.parquet("./mpr"+month+"sorted.parquet")
df = sqlContext.read.parquet('./mpr*sorted.parquet')

STEP 4: look at the spark dataframe schema (it shows decimal(12,2))

第 4 步:查看 spark 数据帧架构(显示十进制(12,2))

df.printSchema()
root
 |-- DATE_KEY: integer (nullable = true)
 |-- AMD:   integer (nullable = true)
 |-- AMD_2: decimal(12,2) (nullable = true)
 |-- AMD_3: decimal(12,2) (nullable = true)
 |-- AMD_4: decimal(12,2) (nullable = true)
 |-- AMD_0: decimal(12,2) (nullable = true)

STEP 5: convert the spark dataframe into a pandas dataframe and replace any Nulls by 0 (with the fillna(0))

第 5 步:将 spark 数据帧转换为 Pandas 数据帧并将任何空值替换为 0(使用 fillna(0))

pdf=df.fillna(0).toPandas()

STEP 6: look at the pandas dataframe info for the relevant columns. AMD is correct (integer), but AMD_4 is of type object where I expected a double or float or something like that (sorry always forget the right type). And since AMD_4 is a non numeric type, I can not use it to be plotted.

第 6 步:查看相关列的 Pandas 数据框信息。AMD 是正确的(整数),但 AMD_4 是 object 类型,我期望 double 或 float 或类似的东西(抱歉总是忘记正确的类型)。而且由于 AMD_4 是非数字类型,我不能用它来绘制。

pdf[['AMD','AMD4']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 20140101 to 20150801
Data columns (total 2 columns):
AMD         20 non-null int64
AMD_4       20 non-null object
dtypes: int64(1), object(1)
memory usage: 480.0+ bytes

So my questions are:

所以我的问题是:

  1. Why is the AMD_4 (and the other AMD_x columns not shown here) of type object, while AMD is of typ int64?
  2. Or in other words how can I make the AMD_x columns in a float/double/decimal kind of type
  1. 为什么 AMD_4(此处未显示其他 AMD_x 列)是 object 类型,而 AMD 是 int64 类型?
  2. 或者换句话说,我怎样才能使 AMD_x 列成为浮点/双精度/十进制类型

回答by WoodChopper

First check pdf.isnull().sum():
1.It should be all zero. For some reason, if some column count returns na or nan, you can always use pandas fillna(),

首先检查pdf.isnull().sum()
1.它应该全部为零。出于某种原因,如果某些列数返回 na 或 nan,您始终可以使用 pandas fillna()

pdf = df.fillna(0).toPandas()
pdf = pdf.fillna(0)

or

或者

pdf=df.toPandas().fillna(0)

2.If all were zeros then, check where is type mismatch with,

2.如果全部为零,则检查类型不匹配的位置,

pdf.applymap(lambda x: isinstance(x, (int, float)))  

And correct it

并纠正它

回答by Gary Liu

I had the same problem and then I figured out what was the reason.

我遇到了同样的问题,然后我想出了是什么原因。

During the conversion, there is a coalesce of data types, such as int/long -> int64, double -> float64, string->obj. For all unknown data types, it will be converted to obj type.

在转换过程中,有数据类型的合并,例如int/long -> int64,double -> float64,string->obj。对于所有未知的数据类型,将转换为 obj 类型。

In Pandas data frame, there is no decimal data type, so all columns of decimal data type are converted to obj type.

在 Pandas 数据框中,没有十进制数据类型,因此所有十进制数据类型的列都转换为 obj 类型。

If you can convert all decimal data type to double type before applying toPandas(), you will have all numerical data ready to use.

如果您可以在应用 toPandas() 之前将所有十进制数据类型转换为双精度类型,那么您将可以使用所有数字数据。

from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df.withColumn('AMD_4', col('AMD_4').cast(DoubleType())).withColumn('AMD_2', col('AMD_2').cast(DoubleType()))
pdf = df.toPandas()

In the pdf, the AMD_4 and AMD_2 will be numerical now.

在 pdf 中,AMD_4 和 AMD_2 现在将是数字。