The purpose of this series of articles on Apache Spark vs Tableau Extract is threefold:
- To provide introductory details on Spark, it’s architecture, it’s strong points and some weak points.
- To provide details on how to get started with Spark, how to install Spark, and how to administer Spark to be used as a regular database server.
- To introduce a question which will give us a framework in achieving the points above.
Apache Spark is a common choice for large scale, data processing tasks. It can be deployed to a production environment with confidence of stability, yet once the basics of the Scala language are learned, for the developer it can feel as lightweight and familiar for everyday data-munging tasks as Python’s Pandas.
Aside from its usability, it has many other strengths. Firstly, it’s highly scalable: It can be delivered across any number of nodes and cores. Even better, you can ensure that it never uses more resources than it needs, and always has more when required. This could be achieved by using Amazon’s Elastic Map Reduce service and setting Spark’s isDynamicAllocationEnabled configuration option, for example. Secondly, Spark is fast: Data can be cached in memory, and columnar type data stores such as ORC, or Parquet, can be used for quickly reading off disk. Thirdly, its highly customisable: A knowledgeable administrator can take full advantage of the information given in Spark UI, tuning different parameters to take full effect of the capabilities of parallel computer processing.
Can Spark beat Tableau Extracts for performance?
We started out with this naïve question after finding that Tableau seemed slow on expensive Count Distinct operations across millions of rows. Making some selections on a dashboard meant waiting 10-20 seconds for a response, somewhat ruining the interactive experience. It became a requirement to improve this performance for a client.
The first idea was to reshape the data, transforming labels into additive measures. If this were possible it would’ve been the correct way to address the problem, however the data could not be reshaped in that way for this problem. Questions needed to be asked across multiple levels of aggregation dynamically:
This is an example of the complex structure of the data. Taking one row from Main Table 1, we can see all the relationships with other tables, and the possible associated rows. Flattening all of this out onto a single dataset to be read by Tableau, you can see that there will be many duplicate rows. We are working across two levels of aggregation in the main tables (known as Fact Tables in the Kimball methodology). If, for example, we wanted to find how many of some attribute from Main Table 2 there is, categorised by another attribute in Subsidiary Table 2, this is really only a calculation which can be done at runtime with a Count Distinct operation. You do not know how many of the duplicated rows you are going to be using in your calculation before a user makes that selection, so there is no way to assign any fractional number to the attribute from Main Table 2 which may be correctly totalled, given that there are a varying number of rows to categorise from. The number of rows vary by the selection that the end user makes.
Having the database in this format was the main hindrance to performance, but the requirement from the client was that it needed to remain in its relational structure as the database was used for other reporting, and shipped as part of a product. Other than changing from a relational database to a graph database, or some other NoSQL technology, there was no remodelling which would have made the situation any better. We could introduce some middleware, however, and this sent us looking for options. Apache Spark, at that time, seemed like a good option for the reasons listed above. We knew that Tableau Extracts were fast and could scale linearly up to a point, but could we outperform it with Spark if the data passed 100 million rows, for example? Could it outperform Tableau Extracts based solely on using expensive Count Distinct measures on the dashboard?
We will explore these questions in the next few articles.
You can read Part 2 here.