I’m currently a third-year at the University of Virginia and for my DS 3002 class, we were tasked with creating an ETL Data Processing Pipeline. ETL stands for Extract, Transform, and Load, which helps summarize exactly what an ETL pipeline is responsible for. Specifically, an ETL pipeline has a series of processes that help extract data from a given source, transform it in some manner, and then load the output into some destination.
Although I could have gone the typical form for creating an ETL pipeline — using a script — the first thing the instructions for the project reminded me of was Shiny Apps in R. My computational biology lab had used Shiny multiple times before to create interactive dashboards that did exactly what an ETL pipeline does: extract, transform, load. I found this to be the perfect opportunity to teach myself something new (Shiny library) and to generally continue developing my programming skills in R. As an aspiring data scientist, both of these skills are critical. I must be able to learn on my own as well as program in R.
What is Shiny and How do you use it?
Although I am familiar with R, I had never used it to create a fully interactive app before. Therefore, in order to get started, I first had to figure out what a Shiny app is and how it is formatted. I figured it would be useful to provide a brief explanation here.
To create a Shiny project in R Studio, navigate to File -> New Project -> Shiny Web Application, where it will ask you to select your desired directory and present you with two options: Single File or Multiple File (as seen in the image below). These are equally as effective as they are basically the same thing. The ‘Single File’ configuration puts the user interface (UI) and server code in one single R file (app.R). The ‘Multiple File’ configuration, on the other hand, puts them in separate files (ui.R and server.R). Depending on the scale of the project you are working on, the ‘Multiple File’ configuration may be a better fit for larger projects. However, both can be utilized.
Difference Between UI and Server
No matter which configuration is selected, the Shiny app will have a user interface (UI) and a server component. The user interface deals with the appearance, or “look and feel” of a Shiny app. This is what is displayed to the user, and can be considered the “client-side” part of the Shiny app.
The server, on the other hand, is where the ‘actual’ work of the Shiny app is completed. This is where the R code that is provided is evaluated; the output generated is then sent to the user interface to be properly displayed to the user. The references listed at the bottom (specifically the second one) do a great job of explaining and showing what these R files/components look like.
About the Dataset
The dataset was obtained from Kaggle and can be found here. It is important to note the link provided has two datasets available (I used the general CSV file, not the 2021 one).
The dataset looks at the World Happiness Report, which is compiled using the Gallup World Poll. It looks at the various happiness indicating metrics such as GDP, social support, life expectancy, freedom, absence of corruption, generosity, etc. These are all believed to have an impact on life evaluations in each country. Overall, the data aims to explain the contributing factors to personal and national variations in happiness. More information can be found on the Kaggle website provided above.
The data cleaning process for this dataset was very basic. I created a new R file in order to do the data cleaning to avoid confusion with my Shiny app. I used the dplyr package in R to rename the columns so that they would be more visually appealing in the Shiny app. I also added a new column called “Country” which used the “Continent” column and the countrycode package to show which country each observation belonged to. The code for this can be seen below:
After all of this, I removed any rows with NA values as there were still a good amount of rows after the removal (1708 rows compared to 1949 rows). The removal of rows wasn’t necessarily detrimental in the grand scheme of the project considering there were still 1708 rows left.
Then, the cleaned dataset was saved as a CSV file in my working directory, and this new CSV file was what was utilized for the Shiny app.
Finished Product and Documentation
I deployed the project and hosted it on shinyapps.io. The directions for how to do this can be found here.
To see my final project, you can click here to go to the interactive dashboard. There, you can filter the World Happiness Report by country, continent, and/or year using the dropdown tools available. In order to download the dataframe you are viewing, simply click on the download button. This will give you the filtered data (filtered based on the country, continent, and/or year values selected). In order to download the original data, you can simply keep the default values of “All” for country, continent, and year, then press download.
As the dropdowns are utilized, the row/column count and statistical summaries of the features are displayed. These values change depending on the restrictions applied to the dataset by the user. These summaries are super useful as they give a quick sense of the data and variable distribution without having to go into each variable and calculate the statistics individually. The statistical summaries for each variable include a count for country and continent (since these are qualitative variables). The rest of the variables have the minimum value, 1st quartile value, median, mean, 3rd quartile value, and maximum value (since they are quantitative variables).
Similarly, the download button is critical as it allows the user to download the original dataset or a subsetted version of it. This can help the user narrow down what they’re interested in looking at. The download button returns a csv file that is saved to the user’s local machine.
The name of the file is dependent on what the user is trying to download. It is formatted as Year-Country-Continent, but not all three parameters are necessary. For example, if the user was trying to get the 2009 observations for Albania, the csv file would be called 2009-Albania.csv. If the user wanted the 2010 observations for Turkey in the Asia continent, the csv file would be called 2010-Turkey-Asia.csv, etc. If all the dropdowns are left to the default value of “All”, meaning no restrictions on the dataset, then the csv downloads as AllData.csv.
As mentioned, the three variables country, continent, and year were selected for the dropdown options. I chose these since they made the most sense to filter by given the nature of the dataset. Typically, a person looking at the data will first be interested in the location of the country or the year the data was collected. The other variables within the dataset were quantitative values that had wide ranges, so using a dropdown for these variables did not seem intuitive.
As you can see, the data processor is extremely easy to use since it is an interactive dashboard, and all of the features listed above make it extremely operational.
The GitHub Repository can be found here.
Top 3 Things I Learned
- Staying Motivated: It can be extremely difficult to stay motivated when learning anything new. When I first began my project, I was unable to get anything working properly. After finding multiple resources online and watching various videos, I was finally able to get small things working. I stayed persistent and eventually learned enough to finish my project. No matter how experienced a programmer you may be, it’s okay to struggle. Embrace it and don’t give up.
- Unit Tests: You can use shinytest and shinyloadtest packages to write unit tests for your app. This is something that I wish I discovered earlier in the development process.
- Variable Scope: When defining variables, be careful where you’re doing so. I initially had only defined df (my dataset) in the server function and not the UI, which caused an error. A solution to this is creating a global.R file where any R objects created become shared with both the server and the UI. For more information on global.R click here.
Below are some resources I found useful in my research of Shiny apps. I highly recommend checking these out and writing your own Shiny app to truly learn how it works. Good luck!