BLOG

Tag Archives: R

How do you import data into RStudio?

Importing data into RStudio is the starting point for data analysis and machine learning. This guide will provide you with the necessary tools that are required to import all types of files into RStudio with code based examples.

Data can exist in multiple formats. Some of the most common formats that you can find your data are listed below:

  • Flat Files – .CSV
  • Tab Delimited Files – .TXT
  • Excel Files – .XLSX
  • Files from the Web
  • SAS files
  • STATA files
  • SPSS files

This article will cover how you can import data into RStudio from all these formats with simple easy to understand code snippets that I have run on RStudio.

Let’s start with the flat files which is commonly known as the .csv files. The .CSV stands for Comma Separated Values. This kind of file type is commonly found on Kaggle datasets and hence it’s very useful to know how to import these if you plan on being a Kaggle Grandmaster.

My files are usually stored in the “Downloads” folder on my mac. It’s important to know where your file is stored on your computer. The code snippet below shows you how you can import your csv file once you know the location of your file:

Note how we have an argument called stringsAsFactors. This argument will convert all the strings in your dataset to factors if set to TRUE.

The next way you can import csv files into RStudio is by using the readr package. This is illustrated below:

Note: The stringsAsFactors argument is set to FALSE by default on the readr package and hence you will have to set it to TRUE manually within the read_csv() function.

Another way you could import csv files into RStudio is by using the data.table package as shown below:

The neat aspect about the fread package is that you can drop or select specific variables of your interest while leaving out the rest.

The next type of file that we want to import is the tab delimited files or the .txt type of files. In the tab delimited files the contents of the file are separated by spaces. We can import a file of this type by following the code snippet shown below:

One of the most common types of files that you might encounter on your data science journey is the excel file. Excel files are particularly common in the field of business analytics and business intelligence. Luckily there exists a package in R that lets you import these excel files known as readxl. In order to import an excel file using the readxl package we have to follow a code similar to the one displayed below:

In the above code the sheet argument is used to specify the number of sheets that you would like to import from the excel file. The col_names = TRUE argument will import the column names into your data frame while the skip argument specifies how many rows of the excel file that you wish to skip, for example if skip = 5, read_excel() would skip the first 5 rows of the excel sheet.

The next type of data that we come across is everyone’s favorite – SQL or the Structured Query Language based databases. SQL databases are plenty and are used widely by many businesses simply because it allows you to store large volumes of relational data in separate tables without taking up a lot of space. These tables can be linked to one other by a specific column and queries can be made to extract the data that we want.

The following code snippet below illustrates how we can use R to connect to a SQL database: 

Once we connect to an SQL database the next step is to list the tables that are present within the database and import a table of interest so that we can write the queries that we want to extract data of interest.

On occasion we might come across flat files on the web that are placed in amazon s3 servers. Some of these files might be located on websites without a download link. In the event that we want to import data from the web we can use the code snippet shown below:

R also enables us to download files straight from the web using the download.file() function as shown below: 

The last kind of files that are very common in the world of business and analytics are the SAS, STATA and SPSS files. These files are widely used as a statistical package that is both powerful and capable of running powerful analytics for firms. Knowledge of how to import these files becomes a critical asset to the toolbelt of any data scientist.

You can import SAS and SPSS files using the foreign package in R as shown below: 

All the important importing functions that R houses has a wide array of arguments that can be used and manipulated to your requirement. In order to lookup the various arguments that come with an importing function you just need to type in the code snippet shown below:

?read.csv

To which RStudio generates the following in the bottom right corner: 

In conclusion, one should never neglect how crucial it is to import your data in the right way. Sometimes you simply get stuck not being able to import datasets into your platform of choice for analysis. This guide would have hopefully covered everything you need to know about importing data into R.

Happy coding!

 

 

 

 

 

 

Getting started with text mining in R – a complete guide

Text based data is all around us – we find text on blogs, reviews, articles, social media, social networks like LinkedIn, e-mails and in surveys. Therefore it is critical that companies and firms use this data to their advantage and gain valuable insights. This article provides you with a comprehensive guide that will help you get started with text mining using R.

Before heading into the technical details that encompasses the world of text mining, let’s try and understand what your workflow should look like when it comes to text mining.

The package that makes text mining possible in R is the qdap package. QDAP (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis & visualization. Below I will showcase all the techniques and tools that you can utilize for effective text mining using the dap package.

The qdap package in R offers us with a wide array of text mining tools. Assume we have a paragraph of text and we want to count the most frequently used words in that text – We can use the qdap package as shown below

 

One of the most important parts of text mining is cleaning your messy text data. The “tm” package that comes with the “qdap” package in R lets you do just that. The tm package essentially allows R to interpret text elements in vectors or data frames as a document. It does this by first converting a text element into a source document by using the VectorSource() or the DataframeSource() functions and then converting these source objects into Corpus. Corpuses can be manipulated and cleaned to our requirements. Let me illustrate how R does this with an example.

Let’s Consider the following Dataset from Netflix:

We are going to isolate the ratingLevel column and use it for text mining.

Once you have your corpus ready you can then proceed to pre-process your text data using the tm_map() function. Let’s illustrate how you can use the tm_map() function with an example:

As you can see in the code executed above the “tolower” argument in tm_map() function has made all the words lowercase.

The various pre-processing arguments that you can use with the tm_map is given below:

  • removePunctuation() – Removes all punctuation like periods and exclamation marks.
  • removeNumbers() – removes all numeric values from your text
  • removeWords() – remove words like “is”, “and” that are defined by you
  • stripWhiteSpace() – removes all tabs and spaces in your text

Word stemming is another pre-processing technique that is used to find the common words from a large pool of words. Assume you have 4 words – Ludacris, Ludabite, Ludarock and LudaMate. If you apply the stemDocument() function to these 4 words you would extract ‘Luda’ as the common word between these 4 words. This is illustrated by a code snipped shown below:

The qdap package also offers other powerful cleaning functions such as:

  • bracketX() – This will remove all text in brackets – “Data (Science)” becomes “Data”
  • replace_number() – 10 becomes “ten”
  • replace_abbreviations() – “Dr” becomes “Doctor”
  • replace_symbol() – “%” becomes “percentage”
  • replace_contraction() – “don’t” becomes “do not”

Sometimes you would want to remove very common words in text such as “is”, “and” “to”, “the” and the like. You might also want to remove words that you think might not have any significant impact on your analysis. For example if you downloaded a dataset titled “World Bank” it might be useful to remove the word “World” or “Bank” as it is likely to be repeated many times with no significant impact. You can implement this in R using the stopwords() function as shown below:

Notice how the word “Parental” that was added to the words_gone vector has disappeared from the words_gone_forever. The stopwords(“en”) contains a list of stop words such as “not”, “be” etc that was also eliminated in the words_gone_forever vector.

There are three types of matrices that can tell you how many times a particular term occurs in a piece of text. They are – Term Document Matrix (TDM) and the Document Term Matrix (DTM) . The structure and code needed to produce these 2 types of matrices are illustrated below:

In the TermDocumentMatrix we can see how the words are listed along the rows while each of the Ratings is in the columns.

In the DocumentTermMatrix we can see how the words are listed in the columns while the Ratings of shows are listed in the rows.

Now that you know how you can clean your text based data and pre-process it to your requirements we need tools to visualize the clean text data so that we can display our insights to the board, CEO, manager or your audience of interest. There are many tools that can be used to visualize your text based data.

The first visualization tool that you would want to use is the bar plot. We can use the bar plot to plot the most frequent words that occur in our text based data as shown below

The next visualization tool is the word cloud. Word clouds are super useful because they instantly showcase how frequent a word appear or the significance of the word with their size in the cloud. We can implement word clouds in R using the code as shown below.

The neat thing about word clouds is that you can use them to compare the words between two different text based data or you can use them to find out the common words between two different text based data. Wordclouds can be created using the wordcloud package in R.

 

Another useful tool for data visualization of text is word networks. Word networks show you the relationship of a particular word with another word. Take a look at an example of a word network below:

With the tools you learnt above you are now ready to tackle your first text mining dataset. The world of text mining is huge and there are a vast amount of concepts and tools that are still left for you to explore.

Never stop learning and happy text mining!

Machine Learning

How do you classify your Machine Learning Problem? 

– Kevin Jolly (Predicting the future using data is what I do in order to afford the next Fallout game) 

Machine learning is a vast topic, and it’s the one of the most essential tool-sets in the tool-belt of a Data-Scientist. Given the depth of how a machine learning problem can take you in your analysis of a large data set, the fundamental application is to create predictions with good accuracy and minimal error.

This being said, many people do get confused on what a machine learning problem is. Calculating the mean or standard deviation using R or Python is NOT a machine learning problem. Note: The key word here is prediction. Without prediction a machine learning problem loses it’s meaning.

In it’s core we are teaching the machine (the CPU) how to predict future outcomes based on past historical data.

There are 3 types of machine learning problems:

  1.  Classification
  2. Regression
  3. Clustering

Classification:

In the classification type of machine learning problem, we want to classify data into a particular category. For example: Classifying weather an incoming mail belongs to the spam folder or not is an example of this type of problem. Classification type of machine learning problems form the largest pool of machine learning problems out there and learning how to implement a solution to such problems are essential when it comes to improving your skills as a data scientist.

Classification type of machine learning problems make use of Decision Trees   and Neural Networks. 

For those of you who use R, can also use the predict() function with the appropriate conditions in the parenthesis in order to predict the outcome based on a classification of your choice. Just make sure to set the type =”class” as an argument.

The performance measures for classification are: Accuracy and Error.

Regression:

Everybody loves regression because you’re the fortune teller who predicts the future or the stock market guru who predicts which stocks will make your client the next million. These predictions are based on trends that can you observe in your data-set. Fundamentally you have two variables in a regression, the response and explanatory variable. The response variable is usually plotted on the Y-axis while the explanatory variable is plotted on the X-axis and using the plot() function in R, you can hopefully observe a positive or a negative trend in your data. Examples of regression machine learning problems include, predicting the future value of a stock, or predicting the performance of an athlete based on his weight, height or other parameters.

Again, in order to implement the regression machine learning problem in R  you can make use of the predict() function but this time the arguments are going to be lm() function which creates a regression model and the unseen data you want to make a prediction on.

The performance measure of your regression model is: Root Mean Squared Error (RMSE)

Clustering: 

Clustering is bit more unsupervised. This means that you don’t need trends in your data nor are you going to be heavily dependent on neural networks. Using the kmeans() in R  function, you are going to take your data-set and divide it into 1,2,….n clusters based on your second argument in the function. The clusters are more or less data having similar traits and properties using which we can carry out further analysis.

One way to to measure how similar your cluster of data is, is to calculate the diameter of your cluster. If the diameter is small you can be pretty sure that the cluster of data is pretty similar within the cluster. Again, appropriate visualization tools can be used to look at your cluster. Calculating the distance between two clusters is also another way of finding out if the two clusters are not similar to each other. Again, larger the distance between the two clusters higher will be the dis-similarity, which is a good thing.

The performance measure for clustering is: The Dunn’s index = Minimal inter-cluster distance/Maximum diameter.