Text based data is all around us – we find text on blogs, reviews, articles, social media, social networks like LinkedIn, e-mails and in surveys. Therefore it is critical that companies and firms use this data to their advantage and gain valuable insights. This article provides you with a comprehensive guide that will help you get started with text mining using R.
Before heading into the technical details that encompasses the world of text mining, let’s try and understand what your workflow should look like when it comes to text mining.
The package that makes text mining possible in R is the qdap package. QDAP (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis & visualization. Below I will showcase all the techniques and tools that you can utilize for effective text mining using the dap package.
The qdap package in R offers us with a wide array of text mining tools. Assume we have a paragraph of text and we want to count the most frequently used words in that text – We can use the qdap package as shown below
One of the most important parts of text mining is cleaning your messy text data. The “tm” package that comes with the “qdap” package in R lets you do just that. The tm package essentially allows R to interpret text elements in vectors or data frames as a document. It does this by first converting a text element into a source document by using the VectorSource() or the DataframeSource() functions and then converting these source objects into Corpus. Corpuses can be manipulated and cleaned to our requirements. Let me illustrate how R does this with an example.
Let’s Consider the following Dataset from Netflix:
We are going to isolate the ratingLevel column and use it for text mining.
Once you have your corpus ready you can then proceed to pre-process your text data using the tm_map() function. Let’s illustrate how you can use the tm_map() function with an example:
As you can see in the code executed above the “tolower” argument in tm_map() function has made all the words lowercase.
The various pre-processing arguments that you can use with the tm_map is given below:
- removePunctuation() – Removes all punctuation like periods and exclamation marks.
- removeNumbers() – removes all numeric values from your text
- removeWords() – remove words like “is”, “and” that are defined by you
- stripWhiteSpace() – removes all tabs and spaces in your text
Word stemming is another pre-processing technique that is used to find the common words from a large pool of words. Assume you have 4 words – Ludacris, Ludabite, Ludarock and LudaMate. If you apply the stemDocument() function to these 4 words you would extract ‘Luda’ as the common word between these 4 words. This is illustrated by a code snipped shown below:
The qdap package also offers other powerful cleaning functions such as:
- bracketX() – This will remove all text in brackets – “Data (Science)” becomes “Data”
- replace_number() – 10 becomes “ten”
- replace_abbreviations() – “Dr” becomes “Doctor”
- replace_symbol() – “%” becomes “percentage”
- replace_contraction() – “don’t” becomes “do not”
Sometimes you would want to remove very common words in text such as “is”, “and” “to”, “the” and the like. You might also want to remove words that you think might not have any significant impact on your analysis. For example if you downloaded a dataset titled “World Bank” it might be useful to remove the word “World” or “Bank” as it is likely to be repeated many times with no significant impact. You can implement this in R using the stopwords() function as shown below:
Notice how the word “Parental” that was added to the words_gone vector has disappeared from the words_gone_forever. The stopwords(“en”) contains a list of stop words such as “not”, “be” etc that was also eliminated in the words_gone_forever vector.
There are three types of matrices that can tell you how many times a particular term occurs in a piece of text. They are – Term Document Matrix (TDM) and the Document Term Matrix (DTM) . The structure and code needed to produce these 2 types of matrices are illustrated below:
In the TermDocumentMatrix we can see how the words are listed along the rows while each of the Ratings is in the columns.
In the DocumentTermMatrix we can see how the words are listed in the columns while the Ratings of shows are listed in the rows.
Now that you know how you can clean your text based data and pre-process it to your requirements we need tools to visualize the clean text data so that we can display our insights to the board, CEO, manager or your audience of interest. There are many tools that can be used to visualize your text based data.
The first visualization tool that you would want to use is the bar plot. We can use the bar plot to plot the most frequent words that occur in our text based data as shown below
The next visualization tool is the word cloud. Word clouds are super useful because they instantly showcase how frequent a word appear or the significance of the word with their size in the cloud. We can implement word clouds in R using the code as shown below.
The neat thing about word clouds is that you can use them to compare the words between two different text based data or you can use them to find out the common words between two different text based data. Wordclouds can be created using the wordcloud package in R.
Another useful tool for data visualization of text is word networks. Word networks show you the relationship of a particular word with another word. Take a look at an example of a word network below:
With the tools you learnt above you are now ready to tackle your first text mining dataset. The world of text mining is huge and there are a vast amount of concepts and tools that are still left for you to explore.
Never stop learning and happy text mining!