How do you classify your Machine Learning Problem?
– Kevin Jolly (Predicting the future using data is what I do in order to afford the next Fallout game)
Machine learning is a vast topic, and it’s the one of the most essential tool-sets in the tool-belt of a Data-Scientist. Given the depth of how a machine learning problem can take you in your analysis of a large data set, the fundamental application is to create predictions with good accuracy and minimal error.
This being said, many people do get confused on what a machine learning problem is. Calculating the mean or standard deviation using R or Python is NOT a machine learning problem. Note: The key word here is prediction. Without prediction a machine learning problem loses it’s meaning.
In it’s core we are teaching the machine (the CPU) how to predict future outcomes based on past historical data.
There are 3 types of machine learning problems:
In the classification type of machine learning problem, we want to classify data into a particular category. For example: Classifying weather an incoming mail belongs to the spam folder or not is an example of this type of problem. Classification type of machine learning problems form the largest pool of machine learning problems out there and learning how to implement a solution to such problems are essential when it comes to improving your skills as a data scientist.
Classification type of machine learning problems make use of Decision Trees and Neural Networks.
For those of you who use R, can also use the predict() function with the appropriate conditions in the parenthesis in order to predict the outcome based on a classification of your choice. Just make sure to set the type =”class” as an argument.
The performance measures for classification are: Accuracy and Error.
Everybody loves regression because you’re the fortune teller who predicts the future or the stock market guru who predicts which stocks will make your client the next million. These predictions are based on trends that can you observe in your data-set. Fundamentally you have two variables in a regression, the response and explanatory variable. The response variable is usually plotted on the Y-axis while the explanatory variable is plotted on the X-axis and using the plot() function in R, you can hopefully observe a positive or a negative trend in your data. Examples of regression machine learning problems include, predicting the future value of a stock, or predicting the performance of an athlete based on his weight, height or other parameters.
Again, in order to implement the regression machine learning problem in R you can make use of the predict() function but this time the arguments are going to be lm() function which creates a regression model and the unseen data you want to make a prediction on.
The performance measure of your regression model is: Root Mean Squared Error (RMSE)
Clustering is bit more unsupervised. This means that you don’t need trends in your data nor are you going to be heavily dependent on neural networks. Using the kmeans() in R function, you are going to take your data-set and divide it into 1,2,….n clusters based on your second argument in the function. The clusters are more or less data having similar traits and properties using which we can carry out further analysis.
One way to to measure how similar your cluster of data is, is to calculate the diameter of your cluster. If the diameter is small you can be pretty sure that the cluster of data is pretty similar within the cluster. Again, appropriate visualization tools can be used to look at your cluster. Calculating the distance between two clusters is also another way of finding out if the two clusters are not similar to each other. Again, larger the distance between the two clusters higher will be the dis-similarity, which is a good thing.
The performance measure for clustering is: The Dunn’s index = Minimal inter-cluster distance/Maximum diameter.