Importing data into RStudio is the starting point for data analysis and machine learning. This guide will provide you with the necessary tools that are required to import all types of files into RStudio with code based examples.
Data can exist in multiple formats. Some of the most common formats that you can find your data are listed below:
- Flat Files – .CSV
- Tab Delimited Files – .TXT
- Excel Files – .XLSX
- Files from the Web
- SAS files
- STATA files
- SPSS files
This article will cover how you can import data into RStudio from all these formats with simple easy to understand code snippets that I have run on RStudio.
Let’s start with the flat files which is commonly known as the .csv files. The .CSV stands for Comma Separated Values. This kind of file type is commonly found on Kaggle datasets and hence it’s very useful to know how to import these if you plan on being a Kaggle Grandmaster.
My files are usually stored in the “Downloads” folder on my mac. It’s important to know where your file is stored on your computer. The code snippet below shows you how you can import your csv file once you know the location of your file:
Note how we have an argument called stringsAsFactors. This argument will convert all the strings in your dataset to factors if set to TRUE.
The next way you can import csv files into RStudio is by using the readr package. This is illustrated below:
Note: The stringsAsFactors argument is set to FALSE by default on the readr package and hence you will have to set it to TRUE manually within the read_csv() function.
Another way you could import csv files into RStudio is by using the data.table package as shown below:
The neat aspect about the fread package is that you can drop or select specific variables of your interest while leaving out the rest.
The next type of file that we want to import is the tab delimited files or the .txt type of files. In the tab delimited files the contents of the file are separated by spaces. We can import a file of this type by following the code snippet shown below:
One of the most common types of files that you might encounter on your data science journey is the excel file. Excel files are particularly common in the field of business analytics and business intelligence. Luckily there exists a package in R that lets you import these excel files known as readxl. In order to import an excel file using the readxl package we have to follow a code similar to the one displayed below:
In the above code the sheet argument is used to specify the number of sheets that you would like to import from the excel file. The col_names = TRUE argument will import the column names into your data frame while the skip argument specifies how many rows of the excel file that you wish to skip, for example if skip = 5, read_excel() would skip the first 5 rows of the excel sheet.
The next type of data that we come across is everyone’s favorite – SQL or the Structured Query Language based databases. SQL databases are plenty and are used widely by many businesses simply because it allows you to store large volumes of relational data in separate tables without taking up a lot of space. These tables can be linked to one other by a specific column and queries can be made to extract the data that we want.
The following code snippet below illustrates how we can use R to connect to a SQL database:
Once we connect to an SQL database the next step is to list the tables that are present within the database and import a table of interest so that we can write the queries that we want to extract data of interest.
On occasion we might come across flat files on the web that are placed in amazon s3 servers. Some of these files might be located on websites without a download link. In the event that we want to import data from the web we can use the code snippet shown below:
R also enables us to download files straight from the web using the download.file() function as shown below:
The last kind of files that are very common in the world of business and analytics are the SAS, STATA and SPSS files. These files are widely used as a statistical package that is both powerful and capable of running powerful analytics for firms. Knowledge of how to import these files becomes a critical asset to the toolbelt of any data scientist.
You can import SAS and SPSS files using the foreign package in R as shown below:
All the important importing functions that R houses has a wide array of arguments that can be used and manipulated to your requirement. In order to lookup the various arguments that come with an importing function you just need to type in the code snippet shown below:
To which RStudio generates the following in the bottom right corner:
In conclusion, one should never neglect how crucial it is to import your data in the right way. Sometimes you simply get stuck not being able to import datasets into your platform of choice for analysis. This guide would have hopefully covered everything you need to know about importing data into R.