scala 如何跳过 Spark 中 CSV 文件的标题？

Question

提问by Hafiz Mujadid

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?

假设我提供了要读取的 Spark 上下文的三个文件路径，并且每个文件在第一行都有一个架构。我们如何从标题中跳过模式行？

val rdd=sc.textFile("file1,file2,file3")

Now, how can we skip header lines from this rdd?

现在，我们如何从这个 rdd 跳过标题行？

Answer 1

采纳答案by Sean Owen

If there were just one header line in the first record, then the most efficient way to filter it out would be:

如果第一条记录中只有一个标题行，那么过滤掉它的最有效方法是：

rdd.mapPartitionsWithIndex {
  (idx, iter) => if (idx == 0) iter.drop(1) else iter 
}

This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.

如果当然有很多文件里面有很多标题行，这无济于事。确实，您可以联合使用这种方式制作的三个 RDD。

You could also just write a filterthat matches only a line that could be a header. This is quite simple, but less efficient.

你也可以只写一个filter只匹配可能是标题的行。这很简单，但效率较低。

Python equivalent:

Python等价物：

from itertools import islice

rdd.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it 
)

Answer 2

回答by Jimmy

data = sc.textFile('path_to_data')
header = data.first() #extract header
data = data.filter(row => row != header)   #filter out header

Answer 3

回答by Sandeep Purohit

In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows:

在 Spark 2.0 中，Spark 内置了 CSV 阅读器，因此您可以轻松加载 CSV 文件，如下所示：

spark.read.option("header","true").csv("filePath")

Answer 4

回答by Shiv4nsh

From Spark 2.0onwards what you can do is use SparkSessionto get this done as a one liner:

从Spark 2.0开始，您可以使用SparkSession将其作为单行代码完成：

val spark = SparkSession.builder.config(conf).getOrCreate()

and then as @SandeepPurohit said:

然后正如@SandeepPurohit 所说：

val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)

I hope it solved your question !

我希望它解决了你的问题！

P.S: SparkSession is the new entry point introduced in Spark 2.0and can be found under spark_sql package

PS：SparkSession是Spark 2.0新引入的入口点，可以在spark_sql包下找到

Answer 5

回答by hayj

In PySpark you can use a dataframe and set header as True:

在 PySpark 中，您可以使用数据框并将标头设置为 True：

df = spark.read.csv(dataPath, header=True)

Answer 6

回答by pzecevic

You could load each file separately, filter them with file.zipWithIndex().filter(_._2 > 0)and then union all the file RDDs.

您可以单独加载每个文件，过滤它们，file.zipWithIndex().filter(_._2 > 0)然后合并所有文件 RDD。

If the number of files is too large, the union could throw a StackOverflowExeption.

如果文件数量太大，联合可能会抛出一个StackOverflowExeption.

Answer 7

回答by Antonio Cachuan

Working in 2018 (Spark 2.3)

2018 年工作 (Spark 2.3)

Python

df = spark.read
    .option("header", "true")
    .format("csv")
    .schema(myManualSchema)
    .load("mycsv.csv")

Scala

斯卡拉

val myDf = spark.read
  .option("header", "true")
  .format("csv")
  .schema(myManualSchema)
  .load("mycsv.csv")

PD1: myManualSchema is a predefined schema written by me, you could skip that part of code

PD1: myManualSchema 是我写的预定义架构，你可以跳过那部分代码

Answer 8

回答by kumara81205

Use the filter()method in PySpark by filtering out the first column name to remove the header:

使用filter()PySpark 中的方法过滤掉第一列名称以删除标题：

# Read file (change format for other file formats)
contentRDD = sc.textfile(<filepath>)

# Filter out first column of the header
filterDD = contentRDD.filter(lambda l: not l.startswith(<first column name>)

# Check your result
for i in filterDD.take(5) : print (i)

Answer 9

回答by Sahan Jayasumana

It's an option that you pass to the read()command:

这是您传递给read()命令的选项：

context = new org.apache.spark.sql.SQLContext(sc)

var data = context.read.option("header","true").csv("<path>")

Answer 10

回答by Adrian Bridgett

Alternatively, you can use the spark-csv package (or in Spark 2.0 this is more or less available natively as CSV). Note that this expects the header on each file (as you desire):

或者，您可以使用 spark-csv 包（或在 Spark 2.0 中，这或多或少以 CSV 的形式在本地可用）。请注意，这需要每个文件的标题（如您所愿）：

schema = StructType([
        StructField('lat',DoubleType(),True),
        StructField('lng',DoubleType(),True)])

df = sqlContext.read.format('com.databricks.spark.csv'). \
     options(header='true',
             delimiter="\t",
             treatEmptyValuesAsNulls=True,
             mode="DROPMALFORMED").load(input_file,schema=schema)

scala 如何跳过 Spark 中 CSV 文件的标题？

提问by Hafiz Mujadid

采纳答案by Sean Owen

回答by Jimmy

回答by Sandeep Purohit

回答by Shiv4nsh

回答by hayj

回答by pzecevic

回答by Antonio Cachuan

回答by kumara81205

回答by Sahan Jayasumana

回答by Adrian Bridgett

相关推荐

最近更新

标签

scala 如何跳过 Spark 中 CSV 文件的标题？

提问by Hafiz Mujadid

采纳答案by Sean Owen

回答by Jimmy

回答by Sandeep Purohit

回答by Shiv4nsh

回答by hayj

回答by pzecevic

回答by Antonio Cachuan

回答by kumara81205

回答by Sahan Jayasumana

回答by Adrian Bridgett

相关推荐

scala Spark：增加分区数量而不会造成洗牌？

scala 使用 Spark 的间歇性超时异常

scala 如何使用json4s从json数组中解析和提取信息

scala Spark UnsupportedOperationException：空集合

相关推荐

最近更新

标签