In my previous post, we set the stage by doing the Spark Installation & Setup (Windows 10). We will now go deeper again in the Tableau TDE vs Apache Spark “head-to-head”.
So, who will win in the Tableau TDE vs Apache Spark stand off?
To test this, first I created many different worksheets in Tableau using count distinct measures. This was done via a Tableau extract created from a SQL Server sample database table with over 12 million rows. I placed those worksheets on a dashboard and began making some selections. The selections took roughly 15 – 20 seconds to populate the dashboard. With this benchmark, we can check if Apache Spark could match or even better this level of performance.
To set Spark up to take requests from Tableau, we need to use Spark SQL. Spark SQL turns RDDs (Resilient Distributed Datasets), which are Spark’s usual method of storing data, into a dataframe with columnar and row information. More information on Spark dataframes can be found here. Most important for the purposes of this experiment, Spark SQL dataframes can be queried using regular SQL. Spark SQL comes pre-built with Apache Hive, a datawarehouse type tool which allows Spark to persist data to disk memory as needed.
We also need to configure a Thrift Server to use the same context as Spark SQL. This is a simple server which will give us connection details which can be connected to from Tableau.
Using Spark SQL, and ensuring that Thrift Server is working, gives us all we need to be able to connect to Spark from Tableau.
Start spark-shell with the following command:
spark-shell –master spark://XXX.XXX.X.X:7077 –executor-cores 4 –conf spark.serializer=org.apache.spark.serializer.KryoSerializer –conf spark.sql.tungsten.enabled=true –conf spark.executor.extraClassPath=C:\MicrosoftJDBC_SQLServer\sqljdbc_6_0\enu\jre8\sqljdbc42.jar –driver-memory 1g –executor-memory 5g –driver-class-path C:\MicrosoftJDBC_SQLServer\sqljdbc_6_0\enu\jre8\sqljdbc42.jar
Setting the application driver program to the master node. This node is also linked to the worker node and executors will be provisioned using this IP.
Setting the number of cores per executor
A Kryo serializer is supposed to be much quicker than the native Spark serializer. Serializers encode the data into information that can be stored. The process of doing this can be slow as data is transformed from one form to another.
Spark’s Tungsten is reporting speeds much quicker due to various improvements. I haven’t tested to see if these performance increases are applicable in our example, but better to be safe.
Setting the class paths for both the driver and the executor. This allows the application to use the JDBC driver to read and write from SQL databases.
–driver-memory 1g –executor-memory 5g
Setting the RAM usage for both the driver, and each of the executors.
Below are all the commands we need to use to take data from a SQL Server, load it into Spark, and have it available to be read by Tableau.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Create a new Spark SQL context, belonging to the same context as the overall application (sc).
val jdbcSqlConnStr = “jdbc:sqlserver://localhost:1433;instance=SQLEXPRESS;databaseName=ContosoRetailDW;user=spark;password=spark;”
Define a connection string to be used later to connect to SQL Server.
val jdbcDbTable = “dbo.FactOnlineSales”
Define a string with the name of the table which we are looking to read from.
val jdbcDF = sqlContext.read.format(“jdbc”).options(Map(“url” -> jdbcSqlConnStr, “dbtable” -> jdbcDbTable)).load()
Using the connection string and table name string defined above, call a Spark SQL function to load data into a Spark dataframe. Spark will hold this data in memory if possible, write to disk if necessary. The great thing about Spark is that it does all of this automatically and intelligently.
sqlContext.sql(“create database if not exists test”)
sqlContext.sql(“drop table if exists test.FactOnlineSales”)
We can use simple SQL language to create a database, and other operations common to relational databases, in Spark SQL.
Here we are saving the in-memory dataframe to a persistent table. It’s not necessary to have an existing Hive deployment, Spark will create one as we do this. The dataframe is now saved to disk and will persist even after this application has been closed.
Setting the port number for a potential Thrift Server which will allow Tableau access in future.
Caching the new table in memory again, for speed.
Finally, starting the Thrift Server with the Spark SQL context which we have been using. This give us an actual server which Tableau can connect to via a JDBC connector.
As we are now ready to go, what are the results? Having replaced the extract with a live connection to Apache Spark, how does it perform? I will answer these questions in my next post!