scala spark-csv 包中的 inferSchema

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31719575/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:25:15  来源:igfitidea点击:

inferSchema in spark-csv package

scalaapache-sparkapache-spark-sqlspark-csv

提问by sag

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?

当 CSV 在 spark 中被读取为数据框时,所有列都被读取为字符串。有没有办法获得列的实际类型?

I have the following csv file

我有以下 csv 文件

Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10

I've read the CSV using the below code

我已使用以下代码阅读了 CSV

val df = sqlContext.
                  read.
                  format("com.databricks.spark.csv").
                  option("header", "true").
                  option("inferSchema", "true").
                  load(sampleAdDataS3Location)
df.schema

All the columns are read as string. I expect the column years_of_experienceto be read as intand DOBto be read as date

所有列都被读取为字符串。我希望列years_of_experience被读作intDOB被读作日期

Please note that I've set the option inferSchemato true.

请注意,我已将选项inferSchema设置为true

I am using the latest version (1.0.3) of spark-csv package

我正在使用最新版本 (1.0.3) 的 spark-csv 包

Am I missing something here?

我在这里错过了什么吗?

回答by zero323

2015-07-30

2015-07-30

The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchemais not included in the latest release.

最新版本实际上是1.1.0,但这并不重要,因为它看起来inferSchema不包含在最新版本中

2015-08-17

2015-08-17

The latest version of the package is now 1.2.0(published on 2015-08-06) and schema inference works as expected:

该包的最新版本现在是1.2.0(发布于 2015-08-06)并且模式推断按预期工作:

scala> df.printSchema
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: string (nullable = true)

Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

关于自动日期解析,我怀疑它是否会发生,或者至少在不提供额外元数据的情况下不会发生。

Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

即使所有字段都遵循某种类似日期的格式,也无法确定是否应将给定字段解释为日期。因此,要么缺少自动日期推断,要么像一团糟的电子表格。更不用说时区问题了。

Finally you can easily parse date string manually:

最后,您可以轻松地手动解析日期字符串:

sqlContext
  .sql("SELECT *, DATE(dob) as dob_d  FROM df")
  .drop("DOB")
  .printSchema

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- dob_d: date (nullable = true)

so it is really not a serious issue.

所以这真的不是一个严重的问题。

2017-12-20:

2017-12-20

Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:

内置 csv 解析器可用,因为 Spark 2.0 支持日期和时间戳的模式推断 - 它使用两个选项:

  • timestampFormatwith default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
  • dateFormatwith default yyyy-MM-dd
  • timestampFormat默认情况下 yyyy-MM-dd'T'HH:mm:ss.SSSXXX
  • dateFormat默认情况下 yyyy-MM-dd

See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

另请参阅如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat”选项)?