Data Structures in R – factors and Data Frame

In the earlier tutorials we learnt vectors, list, arrays and matrix. In this tutorial we will look at factors and data frame

Factors

Factors in R store categorical data. Categorical data has discreet values. For example, a vector of months of births of students in a class contains discreet values or categorical data.

  • Their class is ‘factor’
  • They have an attribute called levels of ‘character’ mode. The levels have to be unique
  • If the factor is ordered then it has an additional class called ‘ordered’
  • A factor cannot contain values that are not in its levels

Lets start looking at some examples

In the above example we let R determine the levels. We can also explicitly give it levels

We gave it 3 levels, but since the data had four levels, R assigned an NA to the value that is not present in levels (‘april’)

We can tell R to calculate levels on its own, but exclude some values

To make factor names more meaningful we can assign labels to each level

If you want to allow NA as a level

If you want to drop the levels that do not occur then pass the factor again through the factor function

Lets now look at ways to extract or replace parts of a factor

Data Frames

Data frame is the most used data structure in R modeling packages. These are the characteristics of a data frame:

  • A data frame is a matrix like data structure. i.e. it has rows and columns. However, unlike a matrix, a data frame can contain columns with different types of values (integer, character etc)
  • A data frame has unique row names.
  • data frame has the class “data.frame”
  • A data frame can be thought of as a list (rows) of vectors(columns). Therefore all values in a column are of same type, however a row can have values of different types
  • Data frame can have non unique column names
  • Character vectors/variables passed to a data frame are converted to factors.

Lets look at some ways to create a data frame

  • data frame from two numeric vectors

    In the example above we have created a data frame using two numeric vectors. The name of the rows and columns have been automatically assigned. Lets specify Column names

    Lets also specify the row names

  • data frame from a numeric vector and a character vector

    This looks obvious, however if you look at the structure of the data frame you will realize that v2 does not contain characters

    The second column contains factors and not characters

  • Other ways to create data frames

    Lets see how constructing a data frame from a list looks like

    The list is added row wise and recycled to fill in all the rows. A data frame can also be created from another data frame and a vector

We now draw our attention to subsetting a data frame. Subsetting is the act of selecting specific values or range of values from the data frame. Since the data frame is a two dimensional structure, the easiest way to select a single value is to specify the row and column. Here’s an example.

Its also possible to select multiple elements

Another way to access a column in a data frame is to use the variable name with the symbol ‘$’

  • dim() – This function gives the dimensions of the data frame
  • names() – This function can be used to get the names of the variables (columns) of a data frame. It can also be used to change their names
  • I() – use the I() function to create a data frame with characters instead of factors <

    In the example above we create a data frame with two variables, however we specify that the second variable should contain characters and should not be converted to vector which is the default behaviour for data frame.

  • head() – Use the head function to get the first n rows of a data frame
  • tail() – Similar to the head function, the tail function can be used to get the last n rows of a data frame.
  • rbind() – This function can be used to add a row to the data frame We tried to add a row that had a character that was not a factor, so R complained. we should rather create the data frame so that R treats strings as strings.

    This adds the row to the end of the data frame. To add the row at a particular index, two solutions have been suggested on StackOverflow

  • cbind() This function can be used to add a new variable or column

In the next tutorial we look at some more functions that can be applied to data frames.

Leave a Reply

Your email address will not be published. Required fields are marked *