Codability using RDD and DataFrame

Naveen - (Founder & Trainer @ NPN Training)

1 min readFeb 5, 2021

With Spark2.0 release, there are 3 types of data abstractions
1. RDD
2. DataFrame (Untyped API)
3. DataSet (Typed API)

RDD (Spark1.0) → Dataframe(Spark1.3) → Dataset(Spark1.6)
RDD being the oldest available from 1.0 version to Dataset being the newest available from 1.6 version. Given same data, each of the 3 abstraction will compute and give same results to user. But they differ in performance and codability.

In this part of tutorial we will take up a use case to solve using RDD and DataFrame.

Use Case : Given the dataset we have to find the average salary for each designation

RDD Approach

def parseRecords(str: String): Tuple2[String, Int] = {
    val fields = str.split(“\\|”)
    (fields(3), fields(4).toInt)
}
val rdd = spark.sparkContext.textFile(“./dataset/u.user”)
val header = rdd.first()
val userRDD = rdd.filter(record=>record!=header)
val reduceAction = userRDD.map(record => parseRecords(record)).
                           map(e => (e._1, (e._2, 1))).
                           reduceByKey((r1, r2) => (r1._1 + r2._1,     r1._2 + r2._2))reduceAction.map(element => (element._1, element._2._1/element._2._2)).
collect.foreach(e=>println(e))

DataFrame Approach

val userDF = spark.read.option(“header”,”true”).option(“inferSchema”,”true”).option(“sep”,”|”).csv(“./dataset/u.user”)
userDF.groupBy(col(“designation”)).agg(avg(col(“salary”))).show()

If you see the same use case can be easily achieved using DataFrame with more readability.

Codability using RDD and DataFrame

Written by Naveen - (Founder & Trainer @ NPN Training)

No responses yet