In **Data Engineering** Apache Spark is probably one of the most popular framework to process huge volume of data. In this blog post I am going to cover the various entry points for Spark Applications and how these have evolved over the releases made.
Every Spark Application needs an entry point that allows it to communicate with data sources and perform operations such as reading and writing data.
In Spark 1.x, three entry points were introduced:
The rest of the blog can be read using this link: Various Entry Points for Apache Spark
Looking to learn Big Data? Join our Training program Big Data Masters Program to learn the concepts in-depth according to industry standards.
In this blog post I will explain Apache Spark Architecture.
Driver Program : Spark driver program is a main entry point which runs the main() method and is the place where the SparkContext is created.
SparkContext : Spark context is a gateway to all the Spark functionalities. It is similar…
Prior to spark 2.0.0
sparkContext was used as a channel to access all spark functionality.
In this blog post we will learn the fundamentals of Containerization and Docker.
Virtualization refers to importing a guest operating system on our host operating system, allowing developers to run multiple OS on different VMs while all of them run on the same host, thereby eliminating the need to provide extra hardware resources.
The rest of the blog can be read using this link: Getting Started with Containerization and Docker
Running joins of skewed data sets we can face stragglers during execution which slow down whole app.
Such slow tasks are mostly results of non-uniform data distribution between workers of spark app. The possible solution is to redistribute large data set between available workes in uniform manner. …
Most of the organizations store their operational data in relational databases. So, there was a need for a tool which can import and export data from relational databases. This is why Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from relational databases on HDFS.
In this blog post I will explain Multi Module Project with Apache Maven. First we will understand what is Multi Module Project in Maven.
A multi-module project is built from an aggregator POM that manages a group of submodules.
In most of the cases, the aggregator or parent pom is located in the project’s root directory and must have packaging of type pom.
You have to understand one thing all the sub modules are regular Maven project, which can be built separately or through aggregator pom.
The rest of the blog can be read using this link: Multi Module Project with Maven
It is very important that we structure our code and use the language features to make the code more readable.
Chaining Transformations helps code more readable and easy to maintain. There are 2 ways through which we can achieve transformation chaining i.e
implicit classes and DataFrame/Dataset#transform method.
The Dataset trasform() method provides a concise syntax for chaining custom transformation.
The rest of the blog can be read using this link: Chaining Transformation in Apache Spark