In this series of articles, R tidyverse vs Alteryx Designer, I intend to assist proficient users of either R (mainly tidyverse) or Alteryx Designer understand the benefits and simplicity of working between the two and the best usage of each of these software packages.
The first post offered a broad overview of the two tools and provided the similitude and main differences as well as a simple walk through example in Alteryx and R.
This second post will put the two tools in a wider context of a data science stack and will highlight opportunities for vertical integration with big data and data visualisation solutions and software.
The third post will look at real life scenarios and why you might choose one over the other.
In the last post I demonstrated how the introduction of tidyverse packages in R makes its data wrangling process similar to a flow in Alteryx. Both products claim to be a complete solution for data science.
The questions for today are: –
- how complete are they really?
- how easy is it for an analyst to use them or integrate them with other tools?
- are they adapted to big data sets and modern algorithms?
There is no tool that will do everything in data science and therefore it is always about building a set of tools (a stack) for a complete cycle. Lets take a detailed look at two aspects of the data science challenges:
Challenge 1: Big data
R – some workarounds
R was not designed with big data in mind, mostly because there was no big data concept at the time R was conceived. Its data frames are stored in memory for fast processing, and it quickly hits the memory limit. I work on a 16GB RAM laptop and I do not use more than 100,000 observations in classical data frames. It is true that R evolved over time to accommodate more data: i.e. the data.table package allows you to use millions of rows with native R commands, a big leap forward for the programmers that were used to the classical R approach and commands. Nevertheless, even data tables cannot go beyond a couple million rows.
My workaround is to leverage R data processing commands by using external software that will hold the data and process the algorithms. My first choice is Spark: there is a classical package that allows native Spark SQL commands in R, SparkR – but this requires the use of Spark syntax for the data scientist.
tidyverse contains the package sparklyr that allows the analysis to use dyplyr syntax on Spark data frames. In other words you continue using the streamlined R syntax (including the pipes that come with dyplyr) but the data resides in and is processed by Spark. It is easy to install Spark locally directly from R, just use: spark_install (version = “2.2.0”) (current working version in October 2017). If you want to install Spark on your computer for wider use, you can follow the step by step instructions from this blog post written by my colleague Colin: Spark vs Tableau TDE If the data sets are really big, by using sparklyr you can leverage the full force of Spark installed on a cluster, in your organisation or in the cloud, use multiple nodes and parallel the calculations.
A similar approach is by using the H2O package. It comes with strong ML algorithms and it also handles the data in its native data frames that do not have the R memory challenge even when it is run locally.
Alteryx – more ‘native’ options
Alteryx has an elegant way to work outside of its own memory data storage: it uses in-database commands that allows the source database to do the heavy lifting as these are usually hosted on robust systems. According to the official Alteryx help:
“In-database processing enables blending and analysis against large sets of data without moving the data out of a database, which can provide significant performance improvements over traditional analysis methods that require data to be moved to a separate environment for processing.”
Now, if the data is not already stored in databases, we have two practical solutions; we either do an Alteryx workflow to extract and move it into databases (for example from JSON large APIs) and then another Alteryx process that uses the database server resources to do the wrangling and calculations.
Moreover, Alteryx also connects to Spark AND leverages the data storage AND processing AND the parallelization power of the multi node Spark architecture. There is also an upcoming full integration with Spark Direct , announced early 2017 – this is meant to make existing in-database tools work with Spark without any code needed. And if the existing options in that set of tools are not enough, Alteryx will release a Spark Code Tool that will do exactly what it sounds like: you will be able to write native Spark code in Alteryx.
Challenge 2: Share reports internally or externally
R – a whole universe
R is very good at quickly plotting datasets or summaries of it on a large variety of charts. The package ggplot2 from tidyverse streamlines the code using the pipes showcased in the previous post.
The easiest way to share R one-time analysis is by using Notebooks, a concept introduced with R markdown package from tidyverse. By using various mark-up options, one can export its work and comments – including graphics – for data science peers interested in coding, but also a notebook that includes graphics and comments for executives, using either the R studio format or html pages.
Using shiny apps one can create interactive applications that can be run locally or shared with peers – either one a Shiny server or on the Shiny as a service in the cloud: shinyapps.io – both having free and paid versions. For a comparison, check here.
For more complex applications, online dashboards, user levels of authentication etc. we can use RStudio Connect, a server platform that allows one to use R Shiny applications or notebooks online, run and email scheduled reporting tasks and have a team collaborate online on the various working versions of a project.
Alteryx – self contained tools
By being a complete software solution and not just a programming language, Alteryx gives its users several advantages in terms of reporting and visualizations.
In terms of interactive apps run locally, the approach is similar to Shiny Apps: On the interactive inputs side Alteryx uses a set of interface elements to allow its end users dynamically input parameters that will guide the analysis. For graphic outputs, the user is given a solid chart set via Charting tool in Reporting set. I use it similarly to ggplot2, to make one graphic at a time that can then be joined side to side using Layout under a more comprehensive dashboard, report or html page using Render.
The whole workflow can be packed as an analytics app and run on the desktop with an Alteryx Designer licence (Free trial here).
For sharing it with the team, the solution is Alteryx Server, an enterprise tool that allows apps to be developed collaboratively on a remote server. The interaction with the apps and the data visualisation takes place in a web browser. The users can choose the apps they want from the app gallery and be provided with interactive parameters (such as data sources selection, periods or other selections), and consume the content as an online service.
Note: here at Perception we prefer a tools stack in which we ETL and model using Alteryx, Python or R, but for visualisations we work with Tableau and Logi. Alteryx exports data frames into TDE files (Tableau Data Extract) and there are workarounds to do the same from R. Nevertheless, I prefer exporting in .csv files, for the simple fact that if one TDE is open in Tableau and we update the extract, it will fail to overwrite the TDE; this is not the case with CSV, so once this is done the user needs to only refresh their data source in Tableau.
I appreciate the thorough comparison of Alteryx vs R however I’m confused by your limitation of 100,000 observations in R. The reason I moved from Excel to R was it’s ability to handle large data sets. How large? In my routine workflow it is not uncommon to handling (importing, modelling, visualizing) data frames of 2 Million records. Once, as a demo, I loaded several tables with a total of 70 million observations and was able to make several joins and transformations to get to an output that would form the basis of analysis. I’ve heard Hadley Wickham, author of the Tidyverse, comment that there should be no issue with running data sets, on a typical laptop, upwards of 20 Million records. So, notion that R struggles with big data is outdated.
My struggle with Alteryx is that it forces me into a set of tools. With R the toolset is almost limitless. The cost in time to ramp up skills to work in R is there but you will need to invest that with any platform and I’d argue that once past basic R and especially the Tidyverse most analysts will have the tools they need for 80 percent of their workflow. The remaining 20% will depend on the project and can be picked up ad hoc as needed.
Your comparison is, for the most part, objective and complete however I wanted to draw out the distinctions around what R can handle.