In this series of articles, R tidyverse vs Alteryx Designer, I intend to assist proficient users of either R (mainly tidyverse) or Alteryx Designer understand the benefits and simplicity of working between the two and the best usage of each of these software packages.
Although I assume that if you are proficient with your chosen tool you probably don’t need to have it defined – at least read the alternative description carefully:
R & tidyverse
R is an open source programming language and software environment for data science. R core comes with a set of functions for data analysis, prebuilt and that provide the essentials needed for a data scientist. R core can be extended with additional packages that facilitate specific tasks. tidyverse is a set of packages that allow R users to do a complete cycle of analysis. It is based on the idea of rectangular data being organized with variables as columns and observations as lines. They introduce a series of concepts (such as tibbles, pipes, verbs) that allow the user analyse the data using a grammar closer to the human language.
Alteryx Designer is a commercial software advertised as a complete tool for data scientists, that allows the users to quickly delve into data from various sources, perform a complex ETL (extract, transform, load) process, model and visualize the results – all using an intuitive, visual interface. Alteryx Design is only working on Windows platforms, and comes as part of a suite that includes Alteryx Server and Alteryx Connect – more on ecosystems in the second blog post.
By using the R predictive tools extension, Alteryx not only benefits from R based standard analysis tools, but it can also extend its tools portfolio with operations based on R custom code.
R – the programmatic way
R users know their code: they like the granular detail and control they have over their data, they usually see numbers and basic charts as the ultimate tools for their data insights.
R has been around for 40 years so its universe is vast. There are thousands of packages that will facilitate almost every step in data analysis and most of the time a user is offered multiple ways to achieve one operation or result.
tidyverse is a relatively new set of packages (since 2014) that attempts to simplify the most used operations within a data science project. Here is a quick overview of some of its packages relevant to this article:
- Readr, jsonlite, xml2, readxl, httr respectively read rectangular data sets from CSV or tab delimited, XML or JSON, Excel or web APIs. Proprietary formats are also covered with haven (SPSS, Strata, SAS)
- tidyr helps you structure your data in the tidyverse way. Each variable is a column, each observation is a line. This makes it a tibble (using the package tibble) which is a rectangular data frame.
- dplyr helps you enrich the data by adding, calculating, summarising or joining both variables and observations. One important function is pipes (written as %>%) that allows a chain of commands to be applied successfully on a data-set. Alteryx users… does this remind you of anything?
- ggplot2 helps you to quickly plot the data using frequently used charts. Easy to imagine if you think of it as Excel charts on steroids. In the next post we will explore how R handles data visualizations and interactive dashboards.
Alteryx Designer – the visual way
Alteryx Designer has a visual environment in which the user drags ‘process’ icons into the work space and connects them logically, deploying a succession of operations, with both input and output.
The tools (processes) are grouped for easy reference in tabs:
- In/out contains useful tools for reading (and writing back) from CSV, databases, JSON files etc.
- Preparation, Transform, Join groups contains separate processes for each operation: filter, order, select, joins, summarizes (as in Pivot in Excel), transpose, deal with duplicates etc.
- Reporting group contains the object charting that is quite similar to charts in ggplot2.
So, instead of writing a line of code in R you would instead drag and drop objects onto the work space from the ribbon, connect them and then fill in the parameters. When using tidyverse and pipes, the R code and the Alteryx Designer workflow look very similar. i.e. a succession of transformations are applied to a data source until it can be joined to other data-sets, analysing the data or simply “summarising” and plotting it.
Other than that, the user still needs to follow the data scientist logic of shaping the data into the desired form (taking a necessary number of steps and operations), analyse and plot it. In other words, for a similar project, an Alteryx user needs to have the same training and understanding in data manipulation and analysis as an R coder.
R tidyverse vs Alteryx Designer in practice
This is a quick example with data on the Fragile States Index – indicators of risk issued by Fund for Peace at http://fsi.fundforpeace.org . I partially pre-processed the data, merging Excel files with multiple sheets and converting them into a columnar CSV.
For this exercise we are given two CSV files: one with an annual score for all the countries in the world in the period 2005-2015, and the second one with classifications of the countries based on their index rank in 2015. Please find them attached if you want to explore the output in R or Alteryx on your own.
We are asked to split the countries into two groups: countries with an average index higher than the average in the 12 years for all the countries at the same level, and the ones with an average lower than the group average (also show the difference between the country average and the level average).
As I said before, both Alteryx users and R programmers will need to understand the logic of data manipulation required in order to answer the question. The resulted process is a succession of operations that lead to the creation of two separate CSVs.
This is the complete Alteryx workflow and its equivalent code in R using tidyverse (click to enlarge):
For this ETL work I personally move much faster using Alteryx, but I have no problem quickly coding it in R either – since tidyverse makes it so easy and straight forward. The logic of tidyverse and Alteryx follow similar steps, especially if one uses the pipes operators introduced with the package magrittr: %>%, T>% or %<>% for example wisely . They let the R coder create a flow between the data tibbles without saving new objects for each transformation and this is very similar to the flow of Alteryx processes.
For the analysts that come from a SQL background, tidyverse also looks familiar. All the selections, grouping and other manipulations are pretty similar to SQL syntax. Actually the team behind tidyverse use it frequently to work on databases: it is one of the main ways to communicate with Spark with the package Sparklyr.
What about more advanced data science with its exploration, plotting, modelling, clustering etc? These will come under scrutiny in the next posts, on integrations, reasons and how to make better use of the two.