Dealing with nulls in Apache Spark

It is a best practice we should always use nulls to represent missing or empty data in a DataFrame. The main reason we should handle is because Spark can optimize when working with null values more than it can if you use empty strings or other values.

The primary way of interacting with null values at DataFrame is to use the .na subpackage on a DataFrame.

All the blank values and empty strings are read into a DataFrame as null by the Spark CSV library

The rest of this post has been migrated here.

--

--

Naveen - (Founder & Trainer @ NPN Training)
Naveen - (Founder & Trainer @ NPN Training)

Written by Naveen - (Founder & Trainer @ NPN Training)

A software training institute which believes that technology has to be learnt under experienced practitioners — www.npntraining.com

No responses yet