In my previous post, Spark vs Tableau Extract I introduced you to Spark and asked whether Spark could beat Tableau Extracts for performance. We will now delve deeper into this question but let’s first set up the Spark environment correctly.
Installation & Setup (Windows 10)
Apache Spark is not too difficult to install and use locally, although it is less straightforward in a Windows environment than it is for Linux or Mac:
- Install Java Development Kit from here.
You need to install the JDK into a path with no spaces, I used C:\jdk.
- Download Spark here.
There is no installation here as Spark comes prebuilt, just unzip the folder into a newly created Spark directory. C:\spark.
- Download winutils.exe here.
This is done instead of installing Hadoop, which is unnecessary for the purposes of our experiment. Winutils can perform the necessary functions which Spark usually requires of Hadoop.
Place winutils.exe in a new directory – C:\winutils\bin
- Add new environment variables, how to do this will vary from one version of Windows to another. Add the following new USER variables:
- SPARK_HOME c:\spark
- JAVA_HOME c:\jdk
- HADOOP_HOME c:\winutils
Add the following paths to your PATH user variable:
- Test it out: Open the command prompt and type spark-shell. You should see something like the following screen:
If you have followed the steps to this point then you have a running version of Apache Spark on your local machine. The genius of Spark is that it will handle the provisioning of resources once you have assigned it some resources. More on that next.
There are different ways to run a spark application, but for the purposes of this post we will be looking at interactively submitting Spark commands through the spark-shell.
Basic Spark Architecture, Cluster Management
Spark is built to automatically make the best use of the resources available to it. There are a lot of ways you can improve performance depending on the type of operations you are looking to perform, but out-of-the-box it should run efficiently in terms of its primary function – allocating resources to the tasks in appropriate order and making use of all parallel processing computing power. Spark figures out what group of tasks in a data-processing job can be done independently, and what ones need to be done sequentially. It then figures out where to place different shards of data (RDD – Resilient Distributed Datasets) to optimally use all the available computing resources. Spark figures out how data needs to be shuffled (more on this later) to complete the overall job, and organises tasks into stages to complete this. There are multiple tasks per stage, and multiple stages per job.
When you start Spark after installing, it is running locally. You can see where it says “master = local[*]” that Spark is running on all local cores available on your machine. By default, Spark starts by using about half a GB of RAM for its driver program which is the only Spark program working when you are running a Spark application locally. If you want to increase the amount of memory available to the driver program you can do that when you are starting the application via spark-shell. For example, if you want to start spark with 4 GB of memory then you enter the setting “—driver-memory 4g”:
You will get the most use out of Spark when using it on top of a cluster manager as it’s really all about parallel computing. Spark has its own standalone cluster manager, it also works with Hadoop Yarn, and Apache Mesos. Unfortunately, they only work on Linux. There are other guides on how to use Yarn, and Mesos, but using Spark’s standalone cluster manager serves the purpose of this post. There are shell scripts which come in the Spark folder for starting Spark master and worker nodes, for starting up a server which can be connected to via JDBC (Java Database Connectivity), and other useful operations, but they’re only written for Linux. In Windows, it can be a little bit trickier, or at least a little bit more of a manual process.
As it states in the documentation on the Spark site for Windows, you will need to start the spark master and worker nodes by hand, as opposed to having a script do the work for you. Firstly, you need to have Spark installed, as described above, on all the machines you want to use with Spark. You will need to make sure that the machines you are looking to link can communicate with each other across a network, and there are no firewall interference’s. Then you need to use the command line of the machine you wish to use as the master node. On that machine, navigate to c:\spark\bin and run:
This will start the master node, and give you the IP and port information – spark://IP:PORT – when you check the spark UI at localhost:8080. Using the IP and port information, you can then start as many worker nodes as you wish by entering the below command in the command lines of the target machines:
spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
There is only a need to start one node per machine.
Once your cluster is up and running, you can launch your driver application on it by entering the IP and Port details of your master node as you start your spark-shell:
spark-shell –master spark://IP:PORT
For the purposes of this post, I started both the master and the worker nodes on my machine. Looking at the Spark UI after starting the application, I can see that I have one worker available to my master node, and there is one application running:
To have a cluster you need at least one master node and one worker node. The only advantage of “running a cluster” on a local machine rather than just running spark in local mode as shown above, is that with a cluster I can specify the number of executors I would like to use. An executor is specific to an application, and is responsible for the execution of tasks. It is optimal to have no more than 4 or 5 cores per executor. There is no performance improvement in doing this on a single machine instead of running in local mode, but it is good to know the steps because these are required should you wish to work with an actual cluster of machines. In spark-shell it is possible to specify the number of cores I would like per executor, because there are 8 cores, I will get 2 executors. I could also specify –num-executors 2.
spark-shell –master spark://XXX.XXX.X.X:7077 –executor-cores 4
The above screenshot shows that I have 4 cores per executor, and the driver executor. The driver program is always the master node for the spark application, as such, we can be assured that the other two executors are residing on the one worker node which I have started also. In a real cluster situation where you are dealing with at least two separate machines, one master and one worker, you would need to have at least one core available for the driver, which is the default configuration. As it stands on the one machine, Spark will take resources from elsewhere as it needs.
In my next post we can begin to answer the question; can Spark be used to beat a Tableau Extract?