How/Why to run Spark History Server on Windows for the COMPLETE(ly) beginner/confused.

Jordan Tang
4 min readMay 5, 2021

Hi,

I’ve recently been learning Apache Spark and I am totally impressed with the number of tutorials(videos, articles, questions) and knowledge out there. It is such a great time to be alive and a student compared to just a few years ago.

One interesting part of Spark is how to optimize your workload. A key component of this process is to view the DAG (directed acyclic graph). It is basically a graphical interpretation of how Spark is doing what Spark does…

DAG image… Cool right?
DAG Image… Cool right?

Now, this can be viewed by default by navigating to your browser and entering localhost:4040. This loads the Spark UI and it has a ton of cool stuff in there. The main problem I was running into when doing an optimization exercise was that the Spark UI would disappear when the job was complete! I was totally confused about how to view the DAG and optimize the code.

Luckily the internet is great and with a google search, I found an article that Eyal Dahari wrote on this very topic.

The History Server is usually needed when after Spark has finished its work. Many times you need it after running spark-submit. When spark-submit is running, you can monitor spark activity through the active monitor on port 4040. But many times you want to monitor after the fact. For this use-case and many others you’ll need the Spark History Server.

I want to expand on Eyal’s tutorial because there were some parts that I was confused about and needed to think about and wanted to write this out.

All history server configurations should be set at the spark-defaults.conf file (remove .template suffix) as described below

You should go to spark config directory and add the spark.history.* configurations to %SPARK_HOME%/conf/spark-defaults.conf. As follows:

If this is a fresh Spark installation navigate to your spark directory/conf folder, there should be a spark-defaults.conf.template file. Open it up via Edit with whichever text editor you use(I use notepad++). You will want to add these lines to the bottom.

spark.eventLog.enabled true

spark.history.fs.logDirectory file:///c:/logs/dir/path

You want to place your directory after the “file:///” ie. ‘file:///C:/Users/Jordan/Spark/Logs’

Images speak 1000 words.. or more

I actually need to add another line here because I was getting other errors:

spark.eventLog.dir file:///c:/logs/dir/path

You have to create the path yourself, as Spark won’t create it.

I made an unknown mistake by installing my Spark in my program files and when running Spark in cmd without admin mode, a whole lot of errors occur. Therefore, I created a new location just for logs.

When that is complete, save the file and make sure to remove the ‘.template’ part of the file name. This makes it a real .conf file…(config file)

Now that the setup is complete, let’s run the server!

After configuration is finished run the following command from %SPARK_HOME%

bin\spark-class.cmd org.apache.spark.deploy.history.HistoryServer

Open up cmd and change the directory to wherever your spark is installed.

cd C:\Program Files\spark-3.1.1-bin-hadoop2.7

A successful output should look something like this :

Success!

Now use your browser and enter ```http://localhost:18080/``` and it should bring you to the apache spark history server service.

Let me know if you have any questions because I probably have the same one (:

Thanks!

--

--

Jordan Tang
0 Followers

I play video games and walk dogs, but sometimes code.