Caching in Spark | What? How? Why?

Published in

Dev Genius

4 min readMar 18, 2024

In this blog, I will be discussing one of the important concepts coming to Spark, which is caching. I will cover the concept in detail, including the need to use caching and how it helps in making your program efficient.

Before delving into caching, it’s important to be familiar with how a Spark program executes, namely the concept of lazy evaluation (Transformations & Actions) in Spark. Let’s have a quick glance over it first…

Lazy Evaluation in Spark

Some common transformations and actions in Spark

Transformations (such as map , filter , etc.) are defined but not executed until an action triggers them . This is called as lazy evaluation meaning the transformations don’t execute immediately. Actions force the execution of transformations and return a result.

This allows Spark to build an optimized execution plan (Directed Acyclic Graph) considering all transformations , and avoid unnecessary computations if the final result doesn’t require all transformations.

Consider an example…

val rdd=sc.textFile("hdfslogs.txt")
val top10=rdd.filter(_.contains("WARN")).take(10)

For the above code snippet,

The execution of the filter operation is deferred until the take action is applied. Spark avoids computing intermediate RDDs; instead, it only computes the first 10 elements of the filtered RDD. Once these 10 elements are computed, the top10 RDD is considered complete, and Spark stops further computation. This approach saves time and space by avoiding the computation of elements that are not required for the final result of the filter operation.

Now let’s move on to the part of Caching

Caching and Persistence

By default the RDDs are recomputed each time we run an action on them . Consider the example,

val rdd = sc.parallelize(1 to 10)
val sum = rdd.reduce(_ + _)
println("First sum: " + sum)

// Some other operations or code here...

val newSum = rdd.reduce(_ + _)
println("Second sum: " + newSum)

In this code, the RDD rdd is created from a parallelized collection of numbers from 1 to 10. The reduce action is then called twice on the RDD. By default, Spark recomputes the RDD each time an action is invoked on it. Therefore, even though the same operation is performed twice on the same RDD, Spark will compute it separately each time.

Now consider this fact if you are using huge datasets , this characteristic of Spark will significantly slow down your program. Hence it could be beneficial to compute an RDD and then cache it so that recomputation is avoided.

How to do caching in Spark?

There are many ways to persist/cache a dataset including the following

In-memory caching
Disk-based caching
In-memory caching as serialized Java Objects
Disk-based caching as serialized Java Objects
Hybrid caching (both in-memory and on-disk, with spillover to disk if dataset exceeds memory capacity to prevent recomputation)

cache() vs persist()

cache() refers to default storage level which is in memory as Java objects. With persist() , persistence can be customized i.e. , passing the storage level you’d like as a parameter to persist. Below is a table of all the storage levels provided by Spark . Here MEMORY_AND_DISK & MEMORY_AND_DISK_SER levels are used when data doesn’t fit in memory , thereby data spills onto disk. MEMORY_AND_DISK_SER stores serialized representation in memory.

Storage Level    Space used  CPU time  In memory  On-disk  Serialized   Recompute some partitions
----------------------------------------------------------------------------------------------------
MEMORY_ONLY          High        Low       Y          N        N         Y    
MEMORY_ONLY_SER      Low         High      Y          N        Y         Y
MEMORY_AND_DISK      High        Medium    Some       Some     Some      N
MEMORY_AND_DISK_SER  Low         High      Some       Some     Y         N
DISK_ONLY            Low         High      N          Y        Y         N

References

Big Data Analysis with Scala and Spark

Offered by École Polytechnique Fédérale de Lausanne. Manipulating big data distributed over a cluster using functional…

www.coursera.org

https://sparkbyexamples.com/spark/spark-persistence-storage-levels/#google_vignette

https://www.researchgate.net/figure/Transformations-and-actions-in-Apache-Spark_tbl1_274716564

Thank you for reading! If you enjoyed this article and would like to stay updated with my future content, feel free to follow me on Medium.

Show support by buying me a coffee !