scala 如何跳过 Spark 中 CSV 文件的标题?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27854919/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I skip a header from CSV files in Spark?
提问by Hafiz Mujadid
Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?
假设我提供了要读取的 Spark 上下文的三个文件路径,并且每个文件在第一行都有一个架构。我们如何从标题中跳过模式行?
val rdd=sc.textFile("file1,file2,file3")
Now, how can we skip header lines from this rdd?
现在,我们如何从这个 rdd 跳过标题行?
采纳答案by Sean Owen
If there were just one header line in the first record, then the most efficient way to filter it out would be:
如果第一条记录中只有一个标题行,那么过滤掉它的最有效方法是:
rdd.mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.
如果当然有很多文件里面有很多标题行,这无济于事。确实,您可以联合使用这种方式制作的三个 RDD。
You could also just write a filterthat matches only a line that could be a header. This is quite simple, but less efficient.
你也可以只写一个filter只匹配可能是标题的行。这很简单,但效率较低。
Python equivalent:
Python等价物:
from itertools import islice
rdd.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
回答by Jimmy
data = sc.textFile('path_to_data')
header = data.first() #extract header
data = data.filter(row => row != header) #filter out header
回答by Sandeep Purohit
In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows:
在 Spark 2.0 中,Spark 内置了 CSV 阅读器,因此您可以轻松加载 CSV 文件,如下所示:
spark.read.option("header","true").csv("filePath")
回答by Shiv4nsh
From Spark 2.0onwards what you can do is use SparkSessionto get this done as a one liner:
从Spark 2.0开始,您可以使用SparkSession将其作为单行代码完成:
val spark = SparkSession.builder.config(conf).getOrCreate()
and then as @SandeepPurohit said:
然后正如@SandeepPurohit 所说:
val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)
I hope it solved your question !
我希望它解决了你的问题!
P.S: SparkSession is the new entry point introduced in Spark 2.0and can be found under spark_sql package
PS:SparkSession是Spark 2.0新引入的入口点,可以在spark_sql包下找到
回答by hayj
In PySpark you can use a dataframe and set header as True:
在 PySpark 中,您可以使用数据框并将标头设置为 True:
df = spark.read.csv(dataPath, header=True)
回答by pzecevic
You could load each file separately, filter them with file.zipWithIndex().filter(_._2 > 0)and then union all the file RDDs.
您可以单独加载每个文件,过滤它们,file.zipWithIndex().filter(_._2 > 0)然后合并所有文件 RDD。
If the number of files is too large, the union could throw a StackOverflowExeption.
如果文件数量太大,联合可能会抛出一个StackOverflowExeption.
回答by Antonio Cachuan
Working in 2018 (Spark 2.3)
2018 年工作 (Spark 2.3)
Python
Python
df = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
Scala
斯卡拉
val myDf = spark.read
.option("header", "true")
.format("csv")
.schema(myManualSchema)
.load("mycsv.csv")
PD1: myManualSchema is a predefined schema written by me, you could skip that part of code
PD1: myManualSchema 是我写的预定义架构,你可以跳过那部分代码
回答by kumara81205
Use the filter()method in PySpark by filtering out the first column name to remove the header:
使用filter()PySpark 中的方法过滤掉第一列名称以删除标题:
# Read file (change format for other file formats)
contentRDD = sc.textfile(<filepath>)
# Filter out first column of the header
filterDD = contentRDD.filter(lambda l: not l.startswith(<first column name>)
# Check your result
for i in filterDD.take(5) : print (i)
回答by Sahan Jayasumana
It's an option that you pass to the read()command:
这是您传递给read()命令的选项:
context = new org.apache.spark.sql.SQLContext(sc)
var data = context.read.option("header","true").csv("<path>")
回答by Adrian Bridgett
Alternatively, you can use the spark-csv package (or in Spark 2.0 this is more or less available natively as CSV). Note that this expects the header on each file (as you desire):
或者,您可以使用 spark-csv 包(或在 Spark 2.0 中,这或多或少以 CSV 的形式在本地可用)。请注意,这需要每个文件的标题(如您所愿):
schema = StructType([
StructField('lat',DoubleType(),True),
StructField('lng',DoubleType(),True)])
df = sqlContext.read.format('com.databricks.spark.csv'). \
options(header='true',
delimiter="\t",
treatEmptyValuesAsNulls=True,
mode="DROPMALFORMED").load(input_file,schema=schema)

