Dealing with nulls in Apache Spark

Naveen - (Founder & Trainer @ NPN Training)

1 min readSep 12, 2019

It is a best practice we should always use nulls to represent missing or empty data in a DataFrame. The main reason we should handle is because Spark can optimize when working with null values more than it can if you use empty strings or other values.

The primary way of interacting with null values at DataFrame is to use the .na subpackage on a DataFrame.

All the blank values and empty strings are read into a DataFrame as null by the Spark CSV library

The rest of this post has been migrated here.

Dealing with nulls in Apache Spark

Written by Naveen - (Founder & Trainer @ NPN Training)

No responses yet