In my previous post I ended it by asking; Having replaced the Tableau Data Extract with a live connection to Apache Spark, how does it perform? we now proceed with the verdict: – Tableau Data Extract vs Apache Spark
The verdict: Tableau Data Extract vs Apache Spark
…and here is the answer…
Not that well.
One of the great things about Spark is its user interface. Travelling to localhost:4040 as we make selections around the dashboard will allow us to get an in-depth view of what’s going on. Let’s look at 3 stages of a job which were run as a view was being rendered in Tableau.
Spark split up the work of the job into three stages. Because of the way that Spark needed to process the data, these stages could not have been merged into any smaller number of stages. Spark finds what work can be done at the same time and what work needs to be done sequentially, and makes stages around that schedule.
We can see the four cores in each of the executors completing individual tasks as part of each stage. You can see that in all the stages, apart from the first stage, each of the 8 cores is constantly working. When one core has finished the task that it has been processing, it takes the next task and starts working again. Only in the first stage, can we see that one core is taken up by an unusually large amount of work. This core keeps working well after the others have finished. We can also see that this core is spending all its time in ‘Executor Computing’, via the colour legend. This might lead us to believe that there is something about the count distinct operation from Tableau which we are trying to do, that cannot be done across multiple cores.
In the second stage, there are many smaller tasks to be completed. The first and most obvious thing about the tasks being completed here is that computing the results which we are looking for is not what is taking the most amount of time. Most of the time is taken up by ‘Shuffle Write Time’. If we look at the third stage, a similar pattern. There are many smaller tasks but most of the time is taken up by ‘Shuffle Read Time’. It’s important to note that the number one recommendation for improving the performance of Spark jobs is to reduce shuffle time. So, what is shuffle read / write time?
Shuffle happens within Spark after certain operations which require data to be moved from one executor or node to another, or from operations such as ReducebyKey and GroupbyKey which require the redistribution of data. There is no doubt that the Tableau view which we are looking to render is doing this as we are looking to count the distinct number of one group by another group. So, it’s perhaps unavoidable.
The Conclusion of Tableau Data Extract vs Apache Spark
In total, to render one view in Tableau using Count Distinct across millions of rows took over 2 minutes on my laptop. This is much worse performance than a normal Tableau extract.
Spark is considered a high-latency piece of software. This means that it has a high overhead regardless of the operation which you are trying to perform. Spark is not built to be used as an interactive query engine. Spark is made for large scale batch processing where it performs very well in distributed environments. The fact that you will not get Spark to respond in milliseconds does not matter for those jobs.