Spark Run local design pattern
Many spark applications have now become legacy applications and it is very hard to enhance, test & run locally.
Spark has very good testing support but still many spark applications are not testable.
I will share one common error that appears when you try to run some old spark applications.
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.<init>(SparkContext.scala:376) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509) at org.apache.spark.sql.SparkSession$Builder$anonfun$6.apply(SparkSession.scala:909) at org.apache.spark.sql.SparkSession$Builder$anonfun$6.apply(SparkSession.scala:901) at scala.Option.getOrElse(Option.scala:121)
When you see such an error you have 2 options:
– Forget that it can’t run locally and continue to work with this frustration
– Fix it to run locally and show the example of The Boy Scout Rule to your team
I will show a very simple pattern that will save you from such frustration.
def main(args: Array[String]): Unit = { val localRun = SparkContextBuilder.isLocalSpark val sparkSession = SparkContextBuilder.newSparkSession(localRun, "Happy Local Spark") val numbers = sparkSession.sparkContext.parallelize(Range.apply(1, 1000)) val total = numbers.sum() println(s"Total Value ${total}") }
This code is using isLocalSpark function to decide how to handle local mode. You can use any technique to make that decision like env parameter or command line parameter or anything else.
Once you know it runs locally then create spark context based on it.
Now this code can run locally or also via Spark-Submit.
Happy Spark Testing.
Code used in this blog is available @ runlocal repo
Published on Java Code Geeks with permission by Ashkrit Sharma, partner at our JCG program. See the original article here: Spark Run local design pattern Opinions expressed by Java Code Geeks contributors are their own. |