Codability using RDD and DataFrame

--

With Spark2.0 release, there are 3 types of data abstractions
1. RDD
2. DataFrame (Untyped API)
3. DataSet (Typed API)

RDD (Spark1.0) → Dataframe(Spark1.3) → Dataset(Spark1.6)
RDD being the oldest available from 1.0 version to Dataset being the newest available from 1.6 version. Given same data, each of the 3 abstraction will compute and give same results to user. But they differ in performance and codability.

In this part of tutorial we will take up a use case to solve using RDD and DataFrame.

Use Case : Given the dataset we have to find the average salary for each designation

RDD Approach

def parseRecords(str: String): Tuple2[String, Int] = {
val fields = str.split(“\\|”)
(fields(3), fields(4).toInt)
}
val rdd = spark.sparkContext.textFile(“./dataset/u.user”)
val header = rdd.first()
val userRDD = rdd.filter(record=>record!=header)
val reduceAction = userRDD.map(record => parseRecords(record)).
map(e => (e._1, (e._2, 1))).
reduceByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2))
reduceAction.map(element => (element._1, element._2._1/element._2._2)).
collect.foreach(e=>println(e))

DataFrame Approach

val userDF = spark.read.option(“header”,”true”).option(“inferSchema”,”true”).option(“sep”,”|”).csv(“./dataset/u.user”)
userDF.groupBy(col(“designation”)).agg(avg(col(“salary”))).show()

If you see the same use case can be easily achieved using DataFrame with more readability.

--

--

Naveen - (Founder & Trainer @ NPN Training)
Naveen - (Founder & Trainer @ NPN Training)

Written by Naveen - (Founder & Trainer @ NPN Training)

A software training institute which believes that technology has to be learnt under experienced practitioners — www.npntraining.com

No responses yet