Where does Testing Spark Machine Learning Algorithms come into play? During my last classification project using R and Alteryx, I faced a real life data problem. The nature of the work involved not just anticipating one status or another for a variable to be predicted, but also to prescribe clear actions to increase the odds of success. Unfortunately when predictive meets prescriptive in classification matters, the only ML algorithms I had was to use a logistic regression in which I could understand the input contribution to the equation result.

What if we had a predictive-only problem of classification? In this example I use a similar data set, e.g. one variable to be classified and ten independent variables to be used. For the sake of saving time, I pre-validated the set and all the independent variables relevance by testing their z-values. The project tries to see if reaching the status of brand Ambassador can be correctly anticipated by the information available on the current order plus their orders history.

So, Spark Machine Learning Algorithms

As I mentioned in a previous post, I use Spark to process larger data sets on my own computer using R. Classical data frames in R are held in memory and cannot really be used for advanced algorithms on a simple laptop. I use sparklyr to be able to use tidyverse syntax in the data manipulation, for speed and ease of use. Spark comes with a set of Machine Learning algorithms for classifications, all tested in this article.

Inspired by a Titanic example served by the people behind Rstudio (and a lot of very useful packages) and reusing most of their code, I’ve been able to do a quick comparison of the accuracy, speed and internals of the six machine learning models.

Data set (csv): final_data
R file: brand_embassador_data_ml_classification_spark_simplified.R

First conclusion is (as expected) that logistic regression is not the most accurate ML algorithm one can use in classifications (although it has the advantage of a larger factors transparency). In my example, similar to the Titanic data provided in the initial example, Random Forest came first in terms of accuracy (simple comparison of the predicted values – split at 50%, and the real ones in the training set).  Second most accurate model, and very close, came Gradient Boosted Trees while for the Titanic survival set it was the Decision Tree. I suspect this is due to the fact that I added a categorical variable that creates more weak learners that Boosted Trees base their classification on. Note that in terms of AUC (area below the guess) the same two models are first, 25 cents of a percentage close, in reverse order.

Accuracy AUC
Random Forest 78.26137284 86.04748666
Gradient Boosted Trees 77.35849057 86.28707109
Decision Tree 77.04595439 84.73993033
Logistic 75.26334066 74.80810257
Neural Net 71.13091793 70.2966332
Naive Bayes 63.28278736 62.27635421


In terms of model-crunching time, in the below image one can easily see that probably in this case Gradient Boosted Trees is by far the slowest to calculate; I obtained a similar result for the Titanic data case, and by the same large difference when using Spark Machine Learning Algorithms.

Spark Machine Learning Algorithms training timesConclusion

Without entering into more details on the number of calculations each model requires, I think we have an overall clean winner here when using Spark Machine Learning Algorithms: Random Forest was the most accurate and close to the top in terms of speed as well. Gradient Boosted Trees came second in terms of accuracy but lacks in terms of speed of calculation. Naive Bayes, the fastest to be calculated model comes last in terms of accuracy, while Logistic is somewhere in the middle for both criteria, but has the advantage of being the one we can use for prescriptions in real life.


Note: I do not claim that by employing these six ML algorithms I reached the most accurate or fastest to be calculated model. I suspect that advanced Deep Learning models outclass the ones tested here in terms of accuracy.  Nevertheless the world of classification could also be a matter of compromise between accuracy and real time speed when using Spark Machine Learning Algorithms. I plan to do another experiment using the same data set and the Google Tensor Flow framework and publish it in a future article. Users of Python have even more Deep Learning libraries available and an article comparing them, or even running on R the same experiment using Tensor Flow and H2O with different settings could get really interesting.