Data Engineering On call 4

Amit Singh Rathore
Dev Genius
Published in
3 min readSep 9, 2023

--

Another day, new learnings.

Earlier parts of the series:

DE On call 1 | DE On call 2 | DE On call 3 | DE On call 4 | DE On call 5

Issue 1

After an update of a spark job, the following exception occurred:

java.lang.ClassCastException: cannot assign instance of scala.None$ to field 
org.apache.spark.scheduler.Task.appAttemptId of type scala.Option in instance of
org.apache.spark.scheduler.ResultTask
.
.
.
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)

This is caused by Scala version mismatch between Spark & App. I looked at the Spark environment (environment tab in the UI) and found the Scala version used in the classpath.

Java Home — /usr/mware/jdk8u352/jre
Java Version — 1.8.0_352
Scala Version — version 2.12.15

After this, I checked the Scala version in the app Jar. There was a transitive dependency where a different version of Scala was being pulled into the classpath. After removing that (using exclusion in the pom file) this issue was fixed.

Issue 2

A restart of the Spark history server (spark2) gave the following error.

ERROR Utils: Uncaught exception in thread 
java.util.NoSuchElementException
at org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:85)
at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$3(FsHistoryProvider.scala:927)

The Service check for this failed as the service check does a regex match on the response body. This was an intermittent issue. And was resolved after all logs were parsed by SHS.

Issue 3

For a job, we got the following error:

IllegalArgumentException: Cannot grow BufferHolder error. 
java.lang.IllegalArgumentException: C
annot grow BufferHolder by size 95969 because the size after growing exceeds size limitation 2147483632

As we already know BufferHolder / Partition has a maximum size of 2147483632 bytes (approximately 2 GB). If a partition is bigger than this and it needs to be shuffled/buffered then we get the above error. Asked the user to repartition the data based on two keys rather than one and it solved the problem.

Issue 4

Another job failed with the following error:

java.lang.StackOverflowError at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$5466/589672638.get$Lambda(Unknown Source)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:777)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:427)

After looking at the physical plan of the job we found out that the plan was very big with multiple repetitions. The user code had many withColumn transformations. The JVM stack was overflowing because of that.

Asked the user to increase the stack size of the driver. The typical default value is 1024KB, aske user to you can increase it to 4M by setting spark.driver.extraJavaOptions to -Xss4M.

Issue 5

An user’s job failed with OOM of Java Heap space. See below error message in under the stage tab.

When I enabled the additional metrics and sorted on the Failed task. I noticed that execution memory was under pressure and was causing the failure.

Asked the user to change the following parameters and the job succeded.

spark.memory.fraction 0.8
spark.memory.storageFraction 0.4

The above config was suggested since the executor was already of size 28G and we don’t allow executors greater than 32G.

Thanks !!

--

--