BLOG

How do you import data into Python?

Python is increasingly growing in popularity thanks to the large number of packages that cater to the need of the data scientist. Importing data into Python thus becomes the starting point for any data science project that you will undertake. This guide gives you a comprehensive introduction into the world of importing data into Python.

There are number of file formats that are avaliable that offer you with a source of structured and unstructured data.

The various sources of structured data are:

  • .CSV files
  • .TXT files
  • Excel files
  • SAS and STATA files
  • HDF5 files
  • Matlab files

The various sources of unstructured data are

  • Data from the web in the form of HTML pages

This guide will teach you the fundamentals of importing data from all these sources straight into your python workspace of choice with minimal effort. So let’s get coding!

1) CSV files

CSV files usually contain mixed data types and it’s best to import the same as a data frame using the pandas package in python. You can do this with the code snippet shown below:

We first import the pandas package and then store the file of interest into a variable called ‘filename’. We then use the function pd.read_csv() in order to read the filename into Python and we save the same into the variable ‘data’. The data.head() function is used to display the first 5 rows along with the column names of the dataset.

2) TXT files

The next type of file that we might encounter on our quest to becoming a master data scientist is the .TXT file. Importing these files into python is as easy as importing the CSV file and can be done with the code snippet shown below:

The above line of code that uses the ‘with’ is called as the context manager in python. The open() function opens the file – ‘file.txt’ as a read only document using the argument ‘r’. We then read the file using the myfile.read() argument and printing out the same. If you want to edit the .txt file that you just imported you would want to use the ‘w’ argument with the open() function instead of the ‘r’ argument.

3) Excel Files

Excel files are a huge part of any business operation and it becomes imperative that you learn exactly how to import these into python for data analysis as a pro data scientist. In order to do this we can use the code snippet shown below:

In the above code we first imported pandas. We then stored in the excel file into a variable called ‘file’ after which we imported the file into python using the pd.ExcelFile() function. Using the ‘.sheet_names) we printed out the sheet names present in the excel file. We then extracted the contents of the first sheet as a dataframe using the ‘.parse()’ function.

4) SAS and STATA files

Statistical analytic software is widespread in the business analytics space and needs to be given due diligence. Let’s take a close look at how can get them into python for further analysis.

Importing SAS files requires the SAS7BDAT package while importing STATA files requires only the pandas package.

5) HDF5

The HDF5 file format stands for  Hierarchal  Data Format version 5. The HDF5 is very popular for storing large quantities of numerical data which can span from a few Gigs to exabytes. It’s very popular in the scientific community for storing experimental data. Fortunately for us we can import these files quite easily into python by using the code snippet shown below:

In the above code snippet the package that we are using to import the hdf5 file is the h5py package. The function h5py.file() can be used to import the file in both read only ‘r’ and write ‘w’ modes.

6) Matlab files

Matlab files are used quite extensively by electronic and computer engineers for designing various electrical and electronic systems. Matlab is built around linear algebra and can store a lot of numerical data that we could use for analysis.  In order to import a matlab file we can use the code snippet illustrated below:

Matlab files can be imported using the spicy.io package and the scipy.io.loadmat() function that comes along with the package. When we import matlab files into python we import it as a dictionary containing key:value pairs of your data from matlab.

7) Data from the web

Data from the web is usually in the form of unstructured data that has no order to linearity to it. However, we can find structured data on some websites like Kaggle and the UCI machine learning repository. Such files can be downloaded directly into python from the web using the code snippet below:’

In the code above we have used the urlretrieve package from urllib.request in order to download a csv file from my website. We then saved it as a dataframe locally using the pandas package.

In order to import HTML pages into python we can make use of the ‘requests’ package and a couple of lines of code that’s shown below:

The requests.get() function sends a request to the server to import the webpage while the file.text will convert the webpage into a text file.

Most of the time data from webpages don’t really make a lot of sense. It’s usually in the form of jumbled up text and a lot of code that does not resonate well with anybody. In order to make sense of the data that we import from the web we have to make use of the BeautifulSoup package that is offered by Python.

The .prettify() function displays useful information about your HTML file in a structured fashion while the .title() function would give you the title of your HTML page. For more information about the various functions and the in-depth documentation of the BeautifulSoup package please visit the link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Now that you have pretty good idea about how you can import data into python you can finally start your next big Hackathon/Kaggle competition! Be sure to keep exploring the various ways you can explore all kinds of data and all the packages available in the python documentation pages found on the web. There’s no end to the knowledge you can acquire.

Happy coding!

 

 

 

How do you import data into RStudio?

Importing data into RStudio is the starting point for data analysis and machine learning. This guide will provide you with the necessary tools that are required to import all types of files into RStudio with code based examples.

Data can exist in multiple formats. Some of the most common formats that you can find your data are listed below:

  • Flat Files – .CSV
  • Tab Delimited Files – .TXT
  • Excel Files – .XLSX
  • Files from the Web
  • SAS files
  • STATA files
  • SPSS files

This article will cover how you can import data into RStudio from all these formats with simple easy to understand code snippets that I have run on RStudio.

Let’s start with the flat files which is commonly known as the .csv files. The .CSV stands for Comma Separated Values. This kind of file type is commonly found on Kaggle datasets and hence it’s very useful to know how to import these if you plan on being a Kaggle Grandmaster.

My files are usually stored in the “Downloads” folder on my mac. It’s important to know where your file is stored on your computer. The code snippet below shows you how you can import your csv file once you know the location of your file:

Note how we have an argument called stringsAsFactors. This argument will convert all the strings in your dataset to factors if set to TRUE.

The next way you can import csv files into RStudio is by using the readr package. This is illustrated below:

Note: The stringsAsFactors argument is set to FALSE by default on the readr package and hence you will have to set it to TRUE manually within the read_csv() function.

Another way you could import csv files into RStudio is by using the data.table package as shown below:

The neat aspect about the fread package is that you can drop or select specific variables of your interest while leaving out the rest.

The next type of file that we want to import is the tab delimited files or the .txt type of files. In the tab delimited files the contents of the file are separated by spaces. We can import a file of this type by following the code snippet shown below:

One of the most common types of files that you might encounter on your data science journey is the excel file. Excel files are particularly common in the field of business analytics and business intelligence. Luckily there exists a package in R that lets you import these excel files known as readxl. In order to import an excel file using the readxl package we have to follow a code similar to the one displayed below:

In the above code the sheet argument is used to specify the number of sheets that you would like to import from the excel file. The col_names = TRUE argument will import the column names into your data frame while the skip argument specifies how many rows of the excel file that you wish to skip, for example if skip = 5, read_excel() would skip the first 5 rows of the excel sheet.

The next type of data that we come across is everyone’s favorite – SQL or the Structured Query Language based databases. SQL databases are plenty and are used widely by many businesses simply because it allows you to store large volumes of relational data in separate tables without taking up a lot of space. These tables can be linked to one other by a specific column and queries can be made to extract the data that we want.

The following code snippet below illustrates how we can use R to connect to a SQL database: 

Once we connect to an SQL database the next step is to list the tables that are present within the database and import a table of interest so that we can write the queries that we want to extract data of interest.

On occasion we might come across flat files on the web that are placed in amazon s3 servers. Some of these files might be located on websites without a download link. In the event that we want to import data from the web we can use the code snippet shown below:

R also enables us to download files straight from the web using the download.file() function as shown below: 

The last kind of files that are very common in the world of business and analytics are the SAS, STATA and SPSS files. These files are widely used as a statistical package that is both powerful and capable of running powerful analytics for firms. Knowledge of how to import these files becomes a critical asset to the toolbelt of any data scientist.

You can import SAS and SPSS files using the foreign package in R as shown below: 

All the important importing functions that R houses has a wide array of arguments that can be used and manipulated to your requirement. In order to lookup the various arguments that come with an importing function you just need to type in the code snippet shown below:

?read.csv

To which RStudio generates the following in the bottom right corner: 

In conclusion, one should never neglect how crucial it is to import your data in the right way. Sometimes you simply get stuck not being able to import datasets into your platform of choice for analysis. This guide would have hopefully covered everything you need to know about importing data into R.

Happy coding!

 

 

 

 

 

 

Getting started with text mining in R – a complete guide

Text based data is all around us – we find text on blogs, reviews, articles, social media, social networks like LinkedIn, e-mails and in surveys. Therefore it is critical that companies and firms use this data to their advantage and gain valuable insights. This article provides you with a comprehensive guide that will help you get started with text mining using R.

Before heading into the technical details that encompasses the world of text mining, let’s try and understand what your workflow should look like when it comes to text mining.

The package that makes text mining possible in R is the qdap package. QDAP (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis & visualization. Below I will showcase all the techniques and tools that you can utilize for effective text mining using the dap package.

The qdap package in R offers us with a wide array of text mining tools. Assume we have a paragraph of text and we want to count the most frequently used words in that text – We can use the qdap package as shown below

 

One of the most important parts of text mining is cleaning your messy text data. The “tm” package that comes with the “qdap” package in R lets you do just that. The tm package essentially allows R to interpret text elements in vectors or data frames as a document. It does this by first converting a text element into a source document by using the VectorSource() or the DataframeSource() functions and then converting these source objects into Corpus. Corpuses can be manipulated and cleaned to our requirements. Let me illustrate how R does this with an example.

Let’s Consider the following Dataset from Netflix:

We are going to isolate the ratingLevel column and use it for text mining.

Once you have your corpus ready you can then proceed to pre-process your text data using the tm_map() function. Let’s illustrate how you can use the tm_map() function with an example:

As you can see in the code executed above the “tolower” argument in tm_map() function has made all the words lowercase.

The various pre-processing arguments that you can use with the tm_map is given below:

  • removePunctuation() – Removes all punctuation like periods and exclamation marks.
  • removeNumbers() – removes all numeric values from your text
  • removeWords() – remove words like “is”, “and” that are defined by you
  • stripWhiteSpace() – removes all tabs and spaces in your text

Word stemming is another pre-processing technique that is used to find the common words from a large pool of words. Assume you have 4 words – Ludacris, Ludabite, Ludarock and LudaMate. If you apply the stemDocument() function to these 4 words you would extract ‘Luda’ as the common word between these 4 words. This is illustrated by a code snipped shown below:

The qdap package also offers other powerful cleaning functions such as:

  • bracketX() – This will remove all text in brackets – “Data (Science)” becomes “Data”
  • replace_number() – 10 becomes “ten”
  • replace_abbreviations() – “Dr” becomes “Doctor”
  • replace_symbol() – “%” becomes “percentage”
  • replace_contraction() – “don’t” becomes “do not”

Sometimes you would want to remove very common words in text such as “is”, “and” “to”, “the” and the like. You might also want to remove words that you think might not have any significant impact on your analysis. For example if you downloaded a dataset titled “World Bank” it might be useful to remove the word “World” or “Bank” as it is likely to be repeated many times with no significant impact. You can implement this in R using the stopwords() function as shown below:

Notice how the word “Parental” that was added to the words_gone vector has disappeared from the words_gone_forever. The stopwords(“en”) contains a list of stop words such as “not”, “be” etc that was also eliminated in the words_gone_forever vector.

There are three types of matrices that can tell you how many times a particular term occurs in a piece of text. They are – Term Document Matrix (TDM) and the Document Term Matrix (DTM) . The structure and code needed to produce these 2 types of matrices are illustrated below:

In the TermDocumentMatrix we can see how the words are listed along the rows while each of the Ratings is in the columns.

In the DocumentTermMatrix we can see how the words are listed in the columns while the Ratings of shows are listed in the rows.

Now that you know how you can clean your text based data and pre-process it to your requirements we need tools to visualize the clean text data so that we can display our insights to the board, CEO, manager or your audience of interest. There are many tools that can be used to visualize your text based data.

The first visualization tool that you would want to use is the bar plot. We can use the bar plot to plot the most frequent words that occur in our text based data as shown below

The next visualization tool is the word cloud. Word clouds are super useful because they instantly showcase how frequent a word appear or the significance of the word with their size in the cloud. We can implement word clouds in R using the code as shown below.

The neat thing about word clouds is that you can use them to compare the words between two different text based data or you can use them to find out the common words between two different text based data. Wordclouds can be created using the wordcloud package in R.

 

Another useful tool for data visualization of text is word networks. Word networks show you the relationship of a particular word with another word. Take a look at an example of a word network below:

With the tools you learnt above you are now ready to tackle your first text mining dataset. The world of text mining is huge and there are a vast amount of concepts and tools that are still left for you to explore.

Never stop learning and happy text mining!

Building an end to end predictive analytics project in 5 easy steps!

If you’re new to the world of data science and analytics and you have some basic knowledge in statistics and R/Python but you’re not sure how you can get started on your first big project that you can add to your portfolio – then this is the guide for you!

In this guide I will cover how you can dig into a dataset and uncover insights using various visualization techniques along with building your machine learning model and validating the same. I will primarily be using R to illustrate a couple of examples but the approach that you need to take are language independent.

So what are these 5 steps?

STEP 1: Importing, Cleaning, Manipulating and Visualizing your Data

STEP 2: Building your machine learning model

STEP 3: Feature Selection

STEP 4: Applying Transformations to your model

STEP 5: Validating your model

Now that you have a brief idea about the steps involved in your project let’s get started!

Before we can actually dig into a dataset we need a dataset. There are a plethora of datasets available online from multiple sources – Some complex, Some Simple and some are just beyond comprehension for the new data scientist. Below are a list of a few datasets that you can download for free that I think are great for someone who has just entered the world of data.

  1. Human Resource Analytics (Kaggle) – https://www.kaggle.com/ludobenistant/hr-analytics 
  2. Credit Card Fraud Detection (Kaggle) – https://www.kaggle.com/dalpozz/creditcardfraud
  3. Iris Species (UCI Machine Learning) – https://www.kaggle.com/uciml/iris
  4. World University Ranking (Kaggle) – https://www.kaggle.com/mylesoneill/world-university-rankings

These 4 datasets are a great starting point because most of them are quite clean and not very messy, contains fewer text elements and mostly numeric factors. When choosing a dataset to work on it’s important that you pick a topic that you are genuinely interested/passionate about because that fosters a curiosity that will lead you to discover the hidden insights from the dataset that are usually as valuable as gold!

STEP 1: Importing, Cleaning, Manipulating and Visualizing your data

Once you have downloaded these files from the respective websites, the first step is to import the data into RStudio or into your Python Workstation of choice.

For R you need to download R and RStudio

Download Link for R:  MAC: https://cran.r-project.org/bin/macosx/

Windows: https://cran.r-project.org/bin/windows/base/

Download Link for RStudio: https://www.rstudio.com/products/rstudio/download/

For Python users you are going to need

Download Link for Python: https://www.python.org/downloads/

Now for the studio that you are going to be using for Python Depends on you but I would recommend Rodeo because it’s very similar to RStudio and it’s perfect for Data Analytics and Predictive Modeling

Download Link for Rodeo: https://www.yhat.com/products/rodeo/downloads

Once you’ve downloaded the required software and follow it’s easy step by step installation procedure your workstation would look something like this: 

The top left corner contains the editor where you will be writing your code while the bottom left is your console where you can execute commands line by line. The top right corner will contain information about your datasets and the bottom right corner will contain information about the packages that you have loaded and the plots that you visualize.

Next we want to import a dataset. There are multiple ways to import a dataset into RStudio and Rodeo using R and Python Respectively and you will want to use different techniques to import different types of datasets like CSV, Excel and the like. Below is an example of importing a CSV File into RStudio

Once we have the dataset the next thing we want to do is clean it for any obvious faults that it may have. Some of the most common faults in any dataset are:

  • Invalid Column names
  • Two types of data under one column
  • Row data stored as column variables
  • Column names stored as row data
  • Single observational unit (Ex: Data related to people only) stored in two data different data frames
  • Multiple observational units ( Ex: Data related to people and aliens) stored in the same data frame.

Below I have identified that my column names are invalid and I proceed to clean it using the code shown below:

Next you want to use the dplyr package in R to manipulate your dataset to get valuable insights from the dataset under study. If you’re a python user you would want to work with the pandas package. Drawing insights using the two packages mentioned above will give you an idea about the key aspects about your data such as what factors actually have an influence on your variable of interest. They also give you an idea about what you need to visualize. Manipulating data using these tools are a course in itself and would probably require another blog post in detail. So let’s get onto the next step.

Once you know what to look for thanks to your awesome manipulation skills we proceed to data visualization. For some, this is the most lucrative part about being a data scientist and for good reason – you portray yourself as an artist at this stage.

In R you would want to go for the Ggplot package while in python you would want to pick the Matplotlib package. Below is an example of a visualization that I had run to figure out who would default on a loan based on his/her account balance

 

The 0 indicates that the person has not defaulted while the 1 indicates that a person has defaulted on their loan. The count indicates the number of people that are inclusive in each category. This kind of visualization is vital in any predictive modeling project.

STEP 2:  Building your machine learning model!

This step requires a clear understanding about how the various kinds of machine learning models work so that you can make an informed choice about which model that you need to pick for the given problem. There are a large number of machine learning algorithms out there such as regression, logistic regressions, random forests and the like.

For a person setting his/her foot into the world of data you would want to use the Caret package in R as it’s very simple to implement. The sci-kit learn is what you would want to use if you’re a python user. Below is a snippet of the code that I’ve used to build a logistic regression model using the caret package in R.

 

STEP 3: Feature Selection

Feature Selection is an important element in any predictive analytics project as you want to determine the features that affect the  decision variable the most! If you have 20+ factors that contribute/affect your decision variable you would want to reduce that to 10 or 8 key variables and build your model around that. Below is the Recursive Feature Elimination that I have applied using the caret package in R to select the features that affect my decision variable the most:

STEP 4: Applying Transforms to your model

There are a multitude of transformations out there. Transformations fundamentally change a certain aspect of your dataset to give you a better prediction and a better fitting model. Sometimes transformations can lead to you overfitting your model which is something you must avoid. Transformations can be applied on a trial and error basis and you can see the results by comparing how your model’s accuracy improves.

Below is a quick summary of all of the transform methods supported in the method argument of the preProcess() function in caret.

  • BoxCox“: apply a Box–Cox transform, values must be non-zero and positive.
  • YeoJohnson“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
  • expoTrans“: apply a power transform like BoxCox and YeoJohnson.
  • zv“: remove attributes with a zero variance (all the same value).
  • nzv“: remove attributes with a near zero variance (close to the same value).
  • center“: subtract mean from values.
  • scale“: divide values by standard deviation.
  • range“: normalize values.
  • pca“: transform data to the principal components.
  • ica“: transform data to the independent components.
  • spatialSign“: project data onto a unit circle.

 

STEP 5: Validating your model

Once you’ve built a couple of models using different machine learning algorithms, or you’ve built models with the same algorithms but with different transformations you will want to compare all your models and see which model works the best for your given problem.  Fortunately for us we have a metric that lets us do just this and it’s called ROC (Reciever Operating Characteristic). Models having the higher average ROC are better. Below is a snippet of how I have used the caretEnsemble package to compare the ROC between two models:

We can clearly see how model 2 is better than model 1 because it has a higher average ROC indicated by the black dot along with a lower variance in it’s data.

That’s it!

If you’ve come this far into this guide I’m sure you can get started with your very first predictive analytics project which you can showcase as your portfolio. I hope this guide helps anyone who’s passionate about breaking into the world of data get their first project out for the world to see.

Be sure to conclude the results of your work and always use LaTex or R markdown to report your work for potential employers to see.

Using GitHub to publish the code that you have used is also another way to gain the interest of employers.

Happy Coding!

How do you choose the right laptop for Data Science?

Data Analysis, Machine Learning model training and the like require some serious processing power. If you’re someone who’s just entered the world of data or if you’re a veteran data scientist that needs an upgrade on his/her local machine this post will provide you with the comprehensive guide that is necessary to make the right choice when it comes to buying a machine that is capable of handling your data-sets.

When it coms to choosing the right machine you usually have to choose between two factors:

  1. Portability
  2. Processing Power

The higher the processing power the heavier the laptop gets and hence it’s portability is reduced and vice versa. The next thing to note is that with higher power the battery life also shrinks and as a result you are losing out on portability yet again. Huge datasets these days have outgrown the processing power of a single machine and will depend on you accessing the cloud for processing, in which case portability is going to be of value to you.

With that said, let’s identify the minimum requirements that you would require when it comes to a laptop worthy of being called a data scientist’s weapon of choice.

RAM

The minimum ram that you would require on your machine would be 8 GB. However 16 GB of RAM is recommended for faster processing of neural networks and other heavy machine learning algorithms as it would significantly speed up the computation time. Personally, 8 gigs of RAM works just fine if you build your algorithms very efficiently and you can put your machine on sleep mode while it takes its times to compute.

GPU

I cannot stress upon the importance of an NVIDIA GPU when it comes to choosing your machine. This is because most deep learning libraries (Theano, Torch, Tensorflow) use the CUDA processor which compiles only on NVIDIA processors. If you want to use a machine that is powered by an AMD or Intel HD GPU you need to be prepared to write a lot of low level code in OpenCL. With that being said, you can opt for the NVIDA 960 series and above.

PROCESSOR 

Once you have the RAM and GPU in check, the processor should come right along with the machine you are selecting. However for the purpose of this guide, the intel i5, 7th generation would be the minimum requirement while the i7, 7th generation would be the ideal recommendation.

STORAGE

SSDs make your machine incredibly fast. However, getting a machine with a good amount of SSD would burn a hole in your wallet. Keeping this in mind, 1 TB of Hard Disk would be the minimum requirement as data sets tend to only get bigger by the day. If you’re opting to go for a machine with an SSD, ensure that there is 256 GB of SSD storage available on the machine. You might have to purchase an external HD in the case of the latter.

With the minimum requirements out of the way let’s find out what the best laptops are in today’s market both in terms of portability and processing power

OPERATING SYSTEM

As a developer you always want to go with linux. Luckily most systems with a MAC or Windows build can run the linux as either a virtual machine or on startup using software like BootCamp for the mac. Additionally parallels is a software that you can use to run two operating systems side by side on your machine.

 

THE BEST MACHINE BUILT FOR PORTABILITY

 

Apple MacBook Pro – £1399/$1429/INR 139,000

 

The MacBook pro is an incredible device for data analysis that is light and has an exceptionally good battery life of 7 hours. The Mac comes with a 2.5 Ghz quad core intel i7 processor, along with 16 gigs of RAM and an NVIDIA 760 M GPU. It has a beautiful 15 inch display as well. The device comes well equipped with a 512 GB of hard disk.

Link to buy the machine in the USA: https://goo.gl/ep0Tdu

Link to buy the machine in the UK: https://goo.gl/i8klpK

Link to buy the machine in India:  https://goo.gl/UqfCJu

THE BEST MACHINE BUILT FOR PROCESSING POWER

MSI gl62 – £849/$939/INR 108,002

The MSI is a pure beast when it comes to processing power because it comes with 16 gigs of RAM, NVIDIA 960M and the intel i7. Apart from these it also has 256 Gigs of SSD and a 1 TB HD. The only downside is that it’s relatively heavy to tug around weighing in at a little over 5.2 pounds. The battery life is not the greatest with only 4 hours of battery life when running normal applications. This reduces to 2 hours when you run data intensive applications and programs. This means that you will always require a power cord in hand. The build quality is good, however you are not going to the premium feel that comes with a MacBook. Overall, it’s a powerful machine.

Link to buy the machine in the US: https://goo.gl/uDXHTu

Link to buy the machine in the UK: https://goo.gl/vqs58M

Link to buy the machine in India: https://goo.gl/TfSkQ9

THE BEST MACHINE BUILT FOR WORKING ON THE CLOUD

If you’re looking for a cheap machine or amazing portability + battery life but still want to run neural networks there’s a solution – work on the cloud. Amazon AWS EC 2 is a virtual machine that lets you run any operating system you want and modify it to your preference and requirement. You can set up a web based IDE for R (RStudio) which is essentially running on the server for another computer that is powerful enough to run your algorithms while you use your computer. All processing is done on Amazon’s servers. So all you need is an internet connection. Amazon AWS EC2 comes with a year of free trial after which you pay according to your RAM/processing power requirements.

The only downside is that it takes some time to learn how to set up and configure the AWS and you need an internet connection at all times to work on your datasets. Barring this, it’s an exceptional way to buy any system of your liking and configure it for the AWS.

With this regard, the MacBook Air is a excellent machine. Windows machines that are as low as 250-350$ can also be configured for the AWS.

In conclusion buying a machine for data science can be a daunting task but this guide should have made things easier for you and you now know what to look for. Below is an infographic that shows you the ideal specifications for a laptop that is built for data science based applications.

 

 

How do you get back to Data Science after a long break?

You’ve spent months and months cleaning and manipulating data, built extensive and comprehensive machine learning models, explored the most complex of data sets that you could get your hands on and now you want a break from the life of data. Break’s are almost always good for you. They breathe new life into you and serve to generate new ideas in your brain once you’re back. The downside? You almost always have a hard time to get back into the state of flow you were in back when you analysed one new data set every other day.

Recently I was on break from the world of data science thanks to extensive commitments that required a lot of my attention. The aftermath of these commitments was a brief vacation of a week. This put me away from RStudio (My platform of choice) for over two weeks. When I opened RStudio again to analyse a data set for work the task was daunting to say the least. To counter this feeling I did a couple of things to get myself prepped up to get my hands dirty with data again. The following tips are sure to get your revved up and enthusiastic about your data after long break!

1. Do a free online course

Online courses are a great way to get your mind back in the game. They have the obvious benefit of updating your knowledge as well as setting you up to get you back into the world of data. If you’d love to brush up on your programming skills in R and Python as well as other data science skills such as Machine Learning, Data Visualization, Data Cleaning or Statistics – DataCamp is a good place to start! They offer a number of free introductory courses. The first chapter of every course is free. This means you could decide to buy their monthly subscription service if you really like what they have to offer. Here’s a link to DataCamp: https://www.datacamp.com/

Another option is to audit free courses on Coursera. They have a wide array of data science based courses that is sure to get your spirits up about data. Take the “A Crash Course in Data Science” by John Hopkins. It’s a quick and easy course that serves as a great refresher to anyone coming back from a long break. The link to the course is here: https://www.coursera.org/learn/data-science-course

Sometime’s learning something really deep and complex is the right key to getting you excited about Data Science. Udacity has a free course in Deep Learning and it’s a course that will teach you the fundamentals of deep learning to making your very own live camera application using deep learning. Give it a shot! The link to the course is: https://www.udacity.com/course/deep-learning–ud730

2. Take part in a Kaggle competition

No, I am not asking you to take part in the next 100K USD prize money whooper of a contest. Take part in an contest that as an intermediate or easy level of difficulty. These are highlighted by the blue and green lines next to the competition

These competitions are relatively straightforward and easy to do. The time investment is low and seeing yourself rank high with a good score is sure to get your blood boiling!

3. Watch a few TED Talks about data

TED talks are a great way to watch some of the greatest minds in the world of data talk about the significance and role that data plays in the 21st century. They serve as a great source of inspiration and will motivate you into taking action yourself.   Some of the best TED talks that have inspired me are listed below

The best stats you’ve ever seen – Hans Rosling

Hans Rosling was a legendary statistician. Watching how he visualizes his data in his video is sure to get you installing ggplot2 or matplotlib.

Philip Evans: How data will transform Business 

This talk by Philip Evans will help you draw the light on the role you play as a data scientist in shaping businesses today.

Conclusion

Life is always hard after a long vacation or a break. The truth of the matter is that we can’t do without them. No matter how much we love data we always need a break away from it to rejuvenate our minds. The process of taking a brake is important to avoid burning out. Breaks also serve to fuel idea generation. Below is an info-graphic that you could use to share with your employees and co-workers if they ever need to recover and get back to data after a long break!

 

Machine Learning

How do you classify your Machine Learning Problem? 

– Kevin Jolly (Predicting the future using data is what I do in order to afford the next Fallout game) 

Machine learning is a vast topic, and it’s the one of the most essential tool-sets in the tool-belt of a Data-Scientist. Given the depth of how a machine learning problem can take you in your analysis of a large data set, the fundamental application is to create predictions with good accuracy and minimal error.

This being said, many people do get confused on what a machine learning problem is. Calculating the mean or standard deviation using R or Python is NOT a machine learning problem. Note: The key word here is prediction. Without prediction a machine learning problem loses it’s meaning.

In it’s core we are teaching the machine (the CPU) how to predict future outcomes based on past historical data.

There are 3 types of machine learning problems:

  1.  Classification
  2. Regression
  3. Clustering

Classification:

In the classification type of machine learning problem, we want to classify data into a particular category. For example: Classifying weather an incoming mail belongs to the spam folder or not is an example of this type of problem. Classification type of machine learning problems form the largest pool of machine learning problems out there and learning how to implement a solution to such problems are essential when it comes to improving your skills as a data scientist.

Classification type of machine learning problems make use of Decision Trees   and Neural Networks. 

For those of you who use R, can also use the predict() function with the appropriate conditions in the parenthesis in order to predict the outcome based on a classification of your choice. Just make sure to set the type =”class” as an argument.

The performance measures for classification are: Accuracy and Error.

Regression:

Everybody loves regression because you’re the fortune teller who predicts the future or the stock market guru who predicts which stocks will make your client the next million. These predictions are based on trends that can you observe in your data-set. Fundamentally you have two variables in a regression, the response and explanatory variable. The response variable is usually plotted on the Y-axis while the explanatory variable is plotted on the X-axis and using the plot() function in R, you can hopefully observe a positive or a negative trend in your data. Examples of regression machine learning problems include, predicting the future value of a stock, or predicting the performance of an athlete based on his weight, height or other parameters.

Again, in order to implement the regression machine learning problem in R  you can make use of the predict() function but this time the arguments are going to be lm() function which creates a regression model and the unseen data you want to make a prediction on.

The performance measure of your regression model is: Root Mean Squared Error (RMSE)

Clustering: 

Clustering is bit more unsupervised. This means that you don’t need trends in your data nor are you going to be heavily dependent on neural networks. Using the kmeans() in R  function, you are going to take your data-set and divide it into 1,2,….n clusters based on your second argument in the function. The clusters are more or less data having similar traits and properties using which we can carry out further analysis.

One way to to measure how similar your cluster of data is, is to calculate the diameter of your cluster. If the diameter is small you can be pretty sure that the cluster of data is pretty similar within the cluster. Again, appropriate visualization tools can be used to look at your cluster. Calculating the distance between two clusters is also another way of finding out if the two clusters are not similar to each other. Again, larger the distance between the two clusters higher will be the dis-similarity, which is a good thing.

The performance measure for clustering is: The Dunn’s index = Minimal inter-cluster distance/Maximum diameter.