scala spark-csv 包中的 inferSchema

Question

提问by sag

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?

当 CSV 在 spark 中被读取为数据框时，所有列都被读取为字符串。有没有办法获得列的实际类型？

I have the following csv file

我有以下 csv 文件

Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10

I've read the CSV using the below code

我已使用以下代码阅读了 CSV

val df = sqlContext.
                  read.
                  format("com.databricks.spark.csv").
                  option("header", "true").
                  option("inferSchema", "true").
                  load(sampleAdDataS3Location)
df.schema

All the columns are read as string. I expect the column years_of_experienceto be read as intand DOBto be read as date

所有列都被读取为字符串。我希望列years_of_experience被读作int而DOB被读作日期

Please note that I've set the option inferSchemato true.

请注意，我已将选项inferSchema设置为true。

I am using the latest version (1.0.3) of spark-csv package

我正在使用最新版本 (1.0.3) 的 spark-csv 包

Am I missing something here?

我在这里错过了什么吗？

Answer 1

回答by zero323

2015-07-30

The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchemais not included in the latest release.

最新版本实际上是1.1.0，但这并不重要，因为它看起来inferSchema不包含在最新版本中。

2015-08-17

The latest version of the package is now 1.2.0(published on 2015-08-06) and schema inference works as expected:

该包的最新版本现在是1.2.0（发布于 2015-08-06）并且模式推断按预期工作：

scala> df.printSchema
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: string (nullable = true)

Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

关于自动日期解析，我怀疑它是否会发生，或者至少在不提供额外元数据的情况下不会发生。

Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

即使所有字段都遵循某种类似日期的格式，也无法确定是否应将给定字段解释为日期。因此，要么缺少自动日期推断，要么像一团糟的电子表格。更不用说时区问题了。

Finally you can easily parse date string manually:

最后，您可以轻松地手动解析日期字符串：

sqlContext
  .sql("SELECT *, DATE(dob) as dob_d  FROM df")
  .drop("DOB")
  .printSchema

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- dob_d: date (nullable = true)

so it is really not a serious issue.

所以这真的不是一个严重的问题。

2017-12-20:

2017-12-20：

Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:

内置 csv 解析器可用，因为 Spark 2.0 支持日期和时间戳的模式推断 - 它使用两个选项：

timestampFormatwith default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
dateFormatwith default yyyy-MM-dd

timestampFormat默认情况下 yyyy-MM-dd'T'HH:mm:ss.SSSXXX
dateFormat默认情况下 yyyy-MM-dd

See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

另请参阅如何强制 CSV 的 inferSchema 将整数视为日期（使用“dateFormat”选项）？

scala spark-csv 包中的 inferSchema

提问by sag

回答by zero323

相关推荐

最近更新

标签

scala spark-csv 包中的 inferSchema

提问by sag

回答by zero323

相关推荐

scala SPARK/SQL：spark 无法解析符号 toDF

scala 从案例类中获取字段名称列表

scala Apache Spark 中的 DataFrame 相等性

scala 如何在没有 SQL 查询的情况下使用 Spark Dataframe 检查相等性？

相关推荐

最近更新

标签