BLOG

Tag Archives: machinelearning

How do you import data into Python?

Python is increasingly growing in popularity thanks to the large number of packages that cater to the need of the data scientist. Importing data into Python thus becomes the starting point for any data science project that you will undertake. This guide gives you a comprehensive introduction into the world of importing data into Python.

There are number of file formats that are avaliable that offer you with a source of structured and unstructured data.

The various sources of structured data are:

  • .CSV files
  • .TXT files
  • Excel files
  • SAS and STATA files
  • HDF5 files
  • Matlab files

The various sources of unstructured data are

  • Data from the web in the form of HTML pages

This guide will teach you the fundamentals of importing data from all these sources straight into your python workspace of choice with minimal effort. So let’s get coding!

1) CSV files

CSV files usually contain mixed data types and it’s best to import the same as a data frame using the pandas package in python. You can do this with the code snippet shown below:

We first import the pandas package and then store the file of interest into a variable called ‘filename’. We then use the function pd.read_csv() in order to read the filename into Python and we save the same into the variable ‘data’. The data.head() function is used to display the first 5 rows along with the column names of the dataset.

2) TXT files

The next type of file that we might encounter on our quest to becoming a master data scientist is the .TXT file. Importing these files into python is as easy as importing the CSV file and can be done with the code snippet shown below:

The above line of code that uses the ‘with’ is called as the context manager in python. The open() function opens the file – ‘file.txt’ as a read only document using the argument ‘r’. We then read the file using the myfile.read() argument and printing out the same. If you want to edit the .txt file that you just imported you would want to use the ‘w’ argument with the open() function instead of the ‘r’ argument.

3) Excel Files

Excel files are a huge part of any business operation and it becomes imperative that you learn exactly how to import these into python for data analysis as a pro data scientist. In order to do this we can use the code snippet shown below:

In the above code we first imported pandas. We then stored in the excel file into a variable called ‘file’ after which we imported the file into python using the pd.ExcelFile() function. Using the ‘.sheet_names) we printed out the sheet names present in the excel file. We then extracted the contents of the first sheet as a dataframe using the ‘.parse()’ function.

4) SAS and STATA files

Statistical analytic software is widespread in the business analytics space and needs to be given due diligence. Let’s take a close look at how can get them into python for further analysis.

Importing SAS files requires the SAS7BDAT package while importing STATA files requires only the pandas package.

5) HDF5

The HDF5 file format stands for  Hierarchal  Data Format version 5. The HDF5 is very popular for storing large quantities of numerical data which can span from a few Gigs to exabytes. It’s very popular in the scientific community for storing experimental data. Fortunately for us we can import these files quite easily into python by using the code snippet shown below:

In the above code snippet the package that we are using to import the hdf5 file is the h5py package. The function h5py.file() can be used to import the file in both read only ‘r’ and write ‘w’ modes.

6) Matlab files

Matlab files are used quite extensively by electronic and computer engineers for designing various electrical and electronic systems. Matlab is built around linear algebra and can store a lot of numerical data that we could use for analysis.  In order to import a matlab file we can use the code snippet illustrated below:

Matlab files can be imported using the spicy.io package and the scipy.io.loadmat() function that comes along with the package. When we import matlab files into python we import it as a dictionary containing key:value pairs of your data from matlab.

7) Data from the web

Data from the web is usually in the form of unstructured data that has no order to linearity to it. However, we can find structured data on some websites like Kaggle and the UCI machine learning repository. Such files can be downloaded directly into python from the web using the code snippet below:’

In the code above we have used the urlretrieve package from urllib.request in order to download a csv file from my website. We then saved it as a dataframe locally using the pandas package.

In order to import HTML pages into python we can make use of the ‘requests’ package and a couple of lines of code that’s shown below:

The requests.get() function sends a request to the server to import the webpage while the file.text will convert the webpage into a text file.

Most of the time data from webpages don’t really make a lot of sense. It’s usually in the form of jumbled up text and a lot of code that does not resonate well with anybody. In order to make sense of the data that we import from the web we have to make use of the BeautifulSoup package that is offered by Python.

The .prettify() function displays useful information about your HTML file in a structured fashion while the .title() function would give you the title of your HTML page. For more information about the various functions and the in-depth documentation of the BeautifulSoup package please visit the link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Now that you have pretty good idea about how you can import data into python you can finally start your next big Hackathon/Kaggle competition! Be sure to keep exploring the various ways you can explore all kinds of data and all the packages available in the python documentation pages found on the web. There’s no end to the knowledge you can acquire.

Happy coding!

 

 

 

Machine Learning

How do you classify your Machine Learning Problem? 

– Kevin Jolly (Predicting the future using data is what I do in order to afford the next Fallout game) 

Machine learning is a vast topic, and it’s the one of the most essential tool-sets in the tool-belt of a Data-Scientist. Given the depth of how a machine learning problem can take you in your analysis of a large data set, the fundamental application is to create predictions with good accuracy and minimal error.

This being said, many people do get confused on what a machine learning problem is. Calculating the mean or standard deviation using R or Python is NOT a machine learning problem. Note: The key word here is prediction. Without prediction a machine learning problem loses it’s meaning.

In it’s core we are teaching the machine (the CPU) how to predict future outcomes based on past historical data.

There are 3 types of machine learning problems:

  1.  Classification
  2. Regression
  3. Clustering

Classification:

In the classification type of machine learning problem, we want to classify data into a particular category. For example: Classifying weather an incoming mail belongs to the spam folder or not is an example of this type of problem. Classification type of machine learning problems form the largest pool of machine learning problems out there and learning how to implement a solution to such problems are essential when it comes to improving your skills as a data scientist.

Classification type of machine learning problems make use of Decision Trees   and Neural Networks. 

For those of you who use R, can also use the predict() function with the appropriate conditions in the parenthesis in order to predict the outcome based on a classification of your choice. Just make sure to set the type =”class” as an argument.

The performance measures for classification are: Accuracy and Error.

Regression:

Everybody loves regression because you’re the fortune teller who predicts the future or the stock market guru who predicts which stocks will make your client the next million. These predictions are based on trends that can you observe in your data-set. Fundamentally you have two variables in a regression, the response and explanatory variable. The response variable is usually plotted on the Y-axis while the explanatory variable is plotted on the X-axis and using the plot() function in R, you can hopefully observe a positive or a negative trend in your data. Examples of regression machine learning problems include, predicting the future value of a stock, or predicting the performance of an athlete based on his weight, height or other parameters.

Again, in order to implement the regression machine learning problem in R  you can make use of the predict() function but this time the arguments are going to be lm() function which creates a regression model and the unseen data you want to make a prediction on.

The performance measure of your regression model is: Root Mean Squared Error (RMSE)

Clustering: 

Clustering is bit more unsupervised. This means that you don’t need trends in your data nor are you going to be heavily dependent on neural networks. Using the kmeans() in R  function, you are going to take your data-set and divide it into 1,2,….n clusters based on your second argument in the function. The clusters are more or less data having similar traits and properties using which we can carry out further analysis.

One way to to measure how similar your cluster of data is, is to calculate the diameter of your cluster. If the diameter is small you can be pretty sure that the cluster of data is pretty similar within the cluster. Again, appropriate visualization tools can be used to look at your cluster. Calculating the distance between two clusters is also another way of finding out if the two clusters are not similar to each other. Again, larger the distance between the two clusters higher will be the dis-similarity, which is a good thing.

The performance measure for clustering is: The Dunn’s index = Minimal inter-cluster distance/Maximum diameter.