If you’re new to the world of data science and analytics and you have some basic knowledge in statistics and R/Python but you’re not sure how you can get started on your first big project that you can add to your portfolio – then this is the guide for you!
In this guide I will cover how you can dig into a dataset and uncover insights using various visualization techniques along with building your machine learning model and validating the same. I will primarily be using R to illustrate a couple of examples but the approach that you need to take are language independent.
So what are these 5 steps?
STEP 1: Importing, Cleaning, Manipulating and Visualizing your Data
STEP 2: Building your machine learning model
STEP 3: Feature Selection
STEP 4: Applying Transformations to your model
STEP 5: Validating your model
Now that you have a brief idea about the steps involved in your project let’s get started!
Before we can actually dig into a dataset we need a dataset. There are a plethora of datasets available online from multiple sources – Some complex, Some Simple and some are just beyond comprehension for the new data scientist. Below are a list of a few datasets that you can download for free that I think are great for someone who has just entered the world of data.
- Human Resource Analytics (Kaggle) – https://www.kaggle.com/ludobenistant/hr-analytics
- Credit Card Fraud Detection (Kaggle) – https://www.kaggle.com/dalpozz/creditcardfraud
- Iris Species (UCI Machine Learning) – https://www.kaggle.com/uciml/iris
- World University Ranking (Kaggle) – https://www.kaggle.com/mylesoneill/world-university-rankings
These 4 datasets are a great starting point because most of them are quite clean and not very messy, contains fewer text elements and mostly numeric factors. When choosing a dataset to work on it’s important that you pick a topic that you are genuinely interested/passionate about because that fosters a curiosity that will lead you to discover the hidden insights from the dataset that are usually as valuable as gold!
STEP 1: Importing, Cleaning, Manipulating and Visualizing your data
Once you have downloaded these files from the respective websites, the first step is to import the data into RStudio or into your Python Workstation of choice.
For R you need to download R and RStudio
Download Link for R: MAC: https://cran.r-project.org/bin/macosx/
Download Link for RStudio: https://www.rstudio.com/products/rstudio/download/
For Python users you are going to need
Download Link for Python: https://www.python.org/downloads/
Now for the studio that you are going to be using for Python Depends on you but I would recommend Rodeo because it’s very similar to RStudio and it’s perfect for Data Analytics and Predictive Modeling
Download Link for Rodeo: https://www.yhat.com/products/rodeo/downloads
Once you’ve downloaded the required software and follow it’s easy step by step installation procedure your workstation would look something like this:
The top left corner contains the editor where you will be writing your code while the bottom left is your console where you can execute commands line by line. The top right corner will contain information about your datasets and the bottom right corner will contain information about the packages that you have loaded and the plots that you visualize.
Next we want to import a dataset. There are multiple ways to import a dataset into RStudio and Rodeo using R and Python Respectively and you will want to use different techniques to import different types of datasets like CSV, Excel and the like. Below is an example of importing a CSV File into RStudio
Once we have the dataset the next thing we want to do is clean it for any obvious faults that it may have. Some of the most common faults in any dataset are:
- Invalid Column names
- Two types of data under one column
- Row data stored as column variables
- Column names stored as row data
- Single observational unit (Ex: Data related to people only) stored in two data different data frames
- Multiple observational units ( Ex: Data related to people and aliens) stored in the same data frame.
Below I have identified that my column names are invalid and I proceed to clean it using the code shown below:
Next you want to use the dplyr package in R to manipulate your dataset to get valuable insights from the dataset under study. If you’re a python user you would want to work with the pandas package. Drawing insights using the two packages mentioned above will give you an idea about the key aspects about your data such as what factors actually have an influence on your variable of interest. They also give you an idea about what you need to visualize. Manipulating data using these tools are a course in itself and would probably require another blog post in detail. So let’s get onto the next step.
Once you know what to look for thanks to your awesome manipulation skills we proceed to data visualization. For some, this is the most lucrative part about being a data scientist and for good reason – you portray yourself as an artist at this stage.
In R you would want to go for the Ggplot package while in python you would want to pick the Matplotlib package. Below is an example of a visualization that I had run to figure out who would default on a loan based on his/her account balance
The 0 indicates that the person has not defaulted while the 1 indicates that a person has defaulted on their loan. The count indicates the number of people that are inclusive in each category. This kind of visualization is vital in any predictive modeling project.
STEP 2: Building your machine learning model!
This step requires a clear understanding about how the various kinds of machine learning models work so that you can make an informed choice about which model that you need to pick for the given problem. There are a large number of machine learning algorithms out there such as regression, logistic regressions, random forests and the like.
For a person setting his/her foot into the world of data you would want to use the Caret package in R as it’s very simple to implement. The sci-kit learn is what you would want to use if you’re a python user. Below is a snippet of the code that I’ve used to build a logistic regression model using the caret package in R.
STEP 3: Feature Selection
Feature Selection is an important element in any predictive analytics project as you want to determine the features that affect the decision variable the most! If you have 20+ factors that contribute/affect your decision variable you would want to reduce that to 10 or 8 key variables and build your model around that. Below is the Recursive Feature Elimination that I have applied using the caret package in R to select the features that affect my decision variable the most:
STEP 4: Applying Transforms to your model
There are a multitude of transformations out there. Transformations fundamentally change a certain aspect of your dataset to give you a better prediction and a better fitting model. Sometimes transformations can lead to you overfitting your model which is something you must avoid. Transformations can be applied on a trial and error basis and you can see the results by comparing how your model’s accuracy improves.
Below is a quick summary of all of the transform methods supported in the method argument of the preProcess() function in caret.
- “BoxCox“: apply a Box–Cox transform, values must be non-zero and positive.
- “YeoJohnson“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
- “expoTrans“: apply a power transform like BoxCox and YeoJohnson.
- “zv“: remove attributes with a zero variance (all the same value).
- “nzv“: remove attributes with a near zero variance (close to the same value).
- “center“: subtract mean from values.
- “scale“: divide values by standard deviation.
- “range“: normalize values.
- “pca“: transform data to the principal components.
- “ica“: transform data to the independent components.
- “spatialSign“: project data onto a unit circle.
STEP 5: Validating your model
Once you’ve built a couple of models using different machine learning algorithms, or you’ve built models with the same algorithms but with different transformations you will want to compare all your models and see which model works the best for your given problem. Fortunately for us we have a metric that lets us do just this and it’s called ROC (Reciever Operating Characteristic). Models having the higher average ROC are better. Below is a snippet of how I have used the caretEnsemble package to compare the ROC between two models:
We can clearly see how model 2 is better than model 1 because it has a higher average ROC indicated by the black dot along with a lower variance in it’s data.
If you’ve come this far into this guide I’m sure you can get started with your very first predictive analytics project which you can showcase as your portfolio. I hope this guide helps anyone who’s passionate about breaking into the world of data get their first project out for the world to see.
Be sure to conclude the results of your work and always use LaTex or R markdown to report your work for potential employers to see.
Using GitHub to publish the code that you have used is also another way to gain the interest of employers.