Background:
In this handout, I tried to cover some key aspects which we need to keep in mind while trying to tune a spark application. This covers tools I frequently used to get more detail info to do better debugging.
Tools and techniques to debug performance issues:
Primary areas in Spark to tune performance:
Note:
We need to be cognizant about all the areas when tuning performance as change in one area has direct impact to others.
It has to be balancing act, need to fine the combination of parameters across all areas which fit better for the problem in hand
Performance Improvement technique- resolve skewness in data
Performance Improvement technique – use cache/persist
Performance Improvement technique – use seq.par.foreach
Allows items in a sequence (seq in scala) to be processed in parallel, hence improving on processing time. Though we need to be careful that logic doesn’t yield non-deterministic results or else will create a race condition.
Example: if we have a foreach loop to process (read and/or transformation) a file on a seq where file names are stored, replacing with seq.par.foreach will allow parallel processing of files and give performance benefit
Performance Improvement technique – use proper joining strategies
Performance Improvement technique – use right serializer
Few simple handy tricks
Keep larger dataset at left in the joining – spark implicitly try to shuffle right dataset first
Apply appropriate partitioning –
very less partitions = less parallelism
too many partitions = scheduling issues, more data shuffling
Try not to use UDFs
UDF execution flow - Deserializes every row to an object > apply lamda function > reserializes the row
Above flow puts pressure on garbage collection by generating lot of garbage
Avoid using UDFs specially if not using scala, as everything needs to be translated to JVM code
Apply all the filters on dataset as early as possible in the processing
Few advance configurationJVM parameters for memory -
Reduce long Garbage collection time
Use ‘Spark UI’ to check time spent by task vs garbage collection
First step of GC tuning is enable GC logging
Analyze log to look for
How frequent GC is occurring
How much memory is cleaned up
Algorithm specific (G1GC, ParallelGC) stage information like minor GC, full GC, major GC
Background:
In this handout, I tried to cover some key aspects which we need to keep in mind while trying to tune a spark application. This covers tools I frequently used to get more detail info to do better debugging.
Tools and techniques to debug performance issues:
Primary areas in Spark to tune performance:
Note:
We need to be cognizant about all the areas when tuning performance as change in one area has direct impact to others.
It has to be balancing act, need to fine the combination of parameters across all areas which fit better for the problem in hand
Performance Improvement technique- resolve skewness in data
Performance Improvement technique – use cache/persist
Performance Improvement technique – use seq.par.foreach
Allows items in a sequence (seq in scala) to be processed in parallel, hence improving on processing time. Though we need to be careful that logic doesn’t yield non-deterministic results or else will create a race condition.
Example: if we have a foreach loop to process (read and/or transformation) a file on a seq where file names are stored, replacing with seq.par.foreach will allow parallel processing of files and give performance benefit
Performance Improvement technique – use proper joining strategies
Performance Improvement technique – use right serializer
Few simple handy tricks
Keep larger dataset at left in the joining – spark implicitly try to shuffle right dataset first
Apply appropriate partitioning –
very less partitions = less parallelism
too many partitions = scheduling issues, more data shuffling
Try not to use UDFs
UDF execution flow - Deserializes every row to an object > apply lamda function > reserializes the row
Above flow puts pressure on garbage collection by generating lot of garbage
Avoid using UDFs specially if not using scala, as everything needs to be translated to JVM code
Apply all the filters on dataset as early as possible in the processing
JVM parameters for memory -
Reduce long Garbage collection time
Use ‘Spark UI’ to check time spent by task vs garbage collection
First step of GC tuning is enable GC logging
Analyze log to look for
How frequent GC is occurring
How much memory is cleaned up
Algorithm specific (G1GC, ParallelGC) stage information like minor GC, full GC, major GC