Codability using RDD and DataFrame
With Spark2.0 release, there are 3 types of data abstractions
1. RDD
2. DataFrame (Untyped API)
3. DataSet (Typed API)
RDD (Spark1.0) → Dataframe(Spark1.3) → Dataset(Spark1.6)
RDD being the oldest available from 1.0 version to Dataset being the newest available from 1.6 version. Given same data, each of the 3 abstraction will compute and give same results to user. But they differ in performance and codability.
In this part of tutorial we will take up a use case to solve using RDD and DataFrame.
Use Case : Given the dataset we have to find the average salary for each designation
RDD Approach
def parseRecords(str: String): Tuple2[String, Int] = {
val fields = str.split(“\\|”)
(fields(3), fields(4).toInt)
}
val rdd = spark.sparkContext.textFile(“./dataset/u.user”)
val header = rdd.first()
val userRDD = rdd.filter(record=>record!=header)
val reduceAction = userRDD.map(record => parseRecords(record)).
map(e => (e._1, (e._2, 1))).
reduceByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2))reduceAction.map(element => (element._1, element._2._1/element._2._2)).
collect.foreach(e=>println(e))
DataFrame Approach
val userDF = spark.read.option(“header”,”true”).option(“inferSchema”,”true”).option(“sep”,”|”).csv(“./dataset/u.user”)
userDF.groupBy(col(“designation”)).agg(avg(col(“salary”))).show()
If you see the same use case can be easily achieved using DataFrame with more readability.