SparkContext & SparkSession

Prior to spark 2.0.0

sparkContext was used as a channel to access all spark functionality.

  • The spark driver program uses spark context to connect to the cluster through a resource manager (YARN or Mesos..).
  • sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node.
  • In order to use APIs of SQL,HIVE , and Streaming, separate contexts need to be created like
val conf=newSparkConf()
val sc = new SparkContext(conf)
val hc = new HiveContext(sc)
val ssc = new StreamingContext(sc).

Spark 2.x onwards

  • SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming.
  • Spark with Dataframe and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.
  • In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.
  • Once the SparkSession is instantiated, we can configure Spark’s run-time config properties.

You can create SparkContext instance with or without creating a SparkConf object first.

spark = SparkSession .builder \
... .master("local") \
... .appName("Word Count") \
... .config("spark.some.config.option", "some-value") \
... .getOrCreate()

You can create a spark session using this

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
c = SparkConf()
SparkSession .builder.config(conf=c)

Looking to learn Big Data? Join our Training program Best Big Data, Hadoop Spark Developer Training Program in Bangalore to learn the concepts in-depth according to industry standards.

--

--

Naveen - (Founder & Trainer @ NPN Training)

A software training institute which believes that technology has to be learnt under experienced practitioners — www.npntraining.com