Data Structures in R – factors and Data Frame

In the earlier tutorials we learnt vectors, list, arrays and matrix. In this tutorial we will look at factors and data frame

Factors

Factors in R store categorical data. Categorical data has discreet values. For example, a vector of months of births of students in a class contains discreet values or categorical data.

Their class is ‘factor’
They have an attribute called levels of ‘character’ mode. The levels have to be unique
If the factor is ordered then it has an additional class called ‘ordered’
A factor cannot contain values that are not in its levels

Lets start looking at some examples


> x=c('jan','jan','march','april')
# the factors of x are the unique levels in x which is jan , march, april
> factor(x)
[1] jan   jan   march april
Levels: april jan march
# the class 'factor'
> f=factor(x)
> class(f)
[1] "factor"
#the levels are
> levels(f)
[1] "april" "jan"   "march"
#if you order the levels (this orders by name)
> ordered(f)
[1] jan   jan   march april
Levels: april < jan < march
> class(o)
[1] "ordered" "factor"

In the above example we let R determine the levels. We can also explicitly give it levels

> factor(x,levels=c('jan','feb','march'))
[1] jan   jan   march <NA> 
Levels: jan feb march

We gave it 3 levels, but since the data had four levels, R assigned an NA to the value that is not present in levels (‘april’)

We can tell R to calculate levels on its own, but exclude some values

# we want to exclude "jan"
> factor(x,exclude=c("jan"))
[1] <NA>  <NA>  march april
Levels: april march

To make factor names more meaningful we can assign labels to each level

> factor(x,labels=c("January","March","April"))
[1] March   March   April   January
Levels: January March April

If you want to allow NA as a level

# NA is NOT a level here
> x=c('jan','jan','march','april',NA)
> factor(x)
[1] jan   jan   march april <NA> 
Levels: april jan march
#NA is a level here
> x=c('jan','jan','march','april',NA)
> factor(x,exclude=NULL)
[1] jan   jan   march april <NA> 
Levels: april jan march <NA>

If you want to drop the levels that do not occur then pass the factor again through the factor function

# the level feb is not used
> factor(x,levels=c('jan','feb','march'))
[1] jan   jan   march <NA>   <NA>  
Levels: jan feb march
# the level feb is dropped
> factor(factor(x,levels=c('jan','feb','march')))
[1] jan   jan   march <NA>   <NA>  
Levels: jan march

Lets now look at ways to extract or replace parts of a factor

> x=c('jan','jan','march','april')
> f=factor(x)
> f
[1] jan   jan   march april
Levels: april jan march
#lets extract the first two levels
> f[1:2]
[1] jan jan
Levels: april jan march
#also drop unused levels
> f[1:2,drop=TRUE]
[1] jan jan
Levels: jan
#replace a factor by one of the levels
> f[1] <- 'march'
> f
[1] march jan   march april
Levels: april jan march

Data Frames

Data frame is the most used data structure in R modeling packages. These are the characteristics of a data frame:

A data frame is a matrix like data structure. i.e. it has rows and columns. However, unlike a matrix, a data frame can contain columns with different types of values (integer, character etc)
A data frame has unique row names.
data frame has the class “data.frame”
A data frame can be thought of as a list (rows) of vectors(columns). Therefore all values in a column are of same type, however a row can have values of different types
Data frame can have non unique column names
Character vectors/variables passed to a data frame are converted to factors.

Lets look at some ways to create a data frame

data frame from two numeric vectors

> a=c(1,2,3,4)
> b=c(5,6,7,8)
> d=data.frame(a,b)
> d
  a b
1 1 5
2 2 6
3 3 7
4 4 8

In the example above we have created a data frame using two numeric vectors. The name of the rows and columns have been automatically assigned. Lets specify Column names

> d=data.frame(v1=a,v2=b)
> d
  v1 v2
1  1  5
2  2  6
3  3  7
4  4  8

Lets also specify the row names

> d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4"))
> d
   v1 v2
r1  1  5
r2  2  6
r3  3  7
r4  4  8

data frame from a numeric vector and a character vector

> a=c(1,2,3,4)
> b=c("b1","b2","b3","b4")
> d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4"))
> d
   v1 v2
r1  1 b1
r2  2 b2
r3  3 b3
r4  4 b4

This looks obvious, however if you look at the structure of the data frame you will realize that v2 does not contain characters

> str(d)
'data.frame':	4 obs. of  2 variables:
 $ v1: num  1 2 3 4
 $ v2: Factor w/ 4 levels "b1","b2","b3",..: 1 2 3 4

The second column contains factors and not characters

Other ways to create data frames

Lets see how constructing a data frame from a list looks like

> a=c(1,2,3,4)
> b=list("one","two")
> data.frame(a=a,b=b)
  a b..one. b..two.
1 1     one     two
2 2     one     two
3 3     one     two
4 4     one     two

The list is added row wise and recycled to fill in all the rows. A data frame can also be created from another data frame and a vector

> a=c(1,2,3,4)
> b=c("one","two","three","four")
> d=data.frame(a=a,b=b)
> d
  a     b
1 1   one
2 2   two
3 3 three
4 4  four
#Now we create a dataframe from 'd' and another boolean vector
> m=c(TRUE,FALSE,FALSE,TRUE)
> e=data.frame(d,m)
> e
  a     b     m
1 1   one  TRUE
2 2   two FALSE
3 3 three FALSE
4 4  four  TRUE

Subsetting

We now draw our attention to subsetting a data frame. Subsetting is the act of selecting specific values or range of values from the data frame. Since the data frame is a two dimensional structure, the easiest way to select a single value is to specify the row and column. Here’s an example.

> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L))
> d
  a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
# we first retrieve the element at first row,second column
> d[1,2]
[1] a
Levels: a b c d e
#recall that by default R stores characters as factors in a data frame
# Another way to select is to use the row and column name
> d["1","a"]
[1] 1
#It will be good to give rows useful names. 
> rownames(d)=c("r1","r2","r3","r4","r5")
> d
   a b  c
r1 1 a 10
r2 2 b 20
r3 3 c 30
r4 4 d 40
r5 5 e 50
> d["r1","a"]
[1] 1
#maybe assign names to columns as well
> colnames(d)=c("c1","c2","c3")
> d
   c1 c2 c3
r1  1  a 10
r2  2  b 20
r3  3  c 30
r4  4  d 40
r5  5  e 50
# note that column names need not be unique. we could have done this
> colnames(d)=c("c1","c1","c1")
> d
   c1 c1 c1
r1  1  a 10
r2  2  b 20
r3  3  c 30
r4  4  d 40
r5  5  e 50
#although there wouldnt be too many reasons to do this.
# We can't do this for rows though
> rownames(d)=c("r1","r2","r3","r4","r4")
Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘r4’

Its also possible to select multiple elements

#To select a complete row
> d["r1",]
   c1 c2 c3
r1  1  a 10
# To select a complete column
> d[,"c1"]
[1] 1 2 3 4 5
#Note that the object returned is of class data.frame
#other way to select a row or a column is to specify a subscript
> d[1]
   c1
r1  1
r2  2
r3  3
r4  4
r5  5
> d[,1]
[1] 1 2 3 4 5
# To select multiple rows/columns do this
> d[1:2]
   c1 c2
r1  1  a
r2  2  b
r3  3  c
r4  4  d
r5  5  e
> d[1:2,]
   c1 c2 c3
r1  1  a 10
r2  2  b 20
# another way to select multiple columns s
> d[,1:2]
   c1 c2
r1  1  a
r2  2  b
r3  3  c
r4  4  d
r5  5  e
# to get a subset of the matrix do this
> d[1:2,1:2]
   c1 c2
r1  1  a
r2  2  b

Another way to access a column in a data frame is to use the variable name with the symbol ‘$’

> d$c1
[1] 1 2 3 4 5

Other useful functions

dim() – This function gives the dimensions of the data frame
```
> dim(d)
[1] 5 3
```

names() – This function can be used to get the names of the variables (columns) of a data frame. It can also be used to change their names

> names(d) = c("col1","col2","col3")
> d
   col1 col2 col3
r1    1    a   10
r2    2    b   20
r3    3    c   30
r4    4    d   40
r5    5    e   50

I() – use the I() function to create a data frame with characters instead of factors <
```
> str(data.frame(x=c(1,2,3),y=I(c(“a”,”b”,”c”))))
‘data.frame’: 3 obs. of 2 variables: 
$ x: num 1 2 3 
$ y:Class ‘AsIs’ chr [1:3] “a” “b” “c”
					
```
In the example above we create a data frame with two variables, however we specify that the second variable should contain characters and should not be converted to vector which is the default behaviour for data frame.
head() – Use the head function to get the first n rows of a data frame
```
> head(d,n=2L)
   col1 col2 col3
r1    1    a   10
r2    2    b   20
```
tail() – Similar to the head function, the tail function can be used to get the last n rows of a data frame.
```
> tail(d,n=2L)
   col1 col2 col3
r4    4    d   40
r5    5    e   50
> 
```

rbind() – This function can be used to add a row to the data frame

> rbind(d,I(c(4,"f",60)))
   col1 col2 col3
r1    1    a   10
r2    2    b   20
r3    3    c   30
r4    4    d   40
r5    5    e   50
6     4 <NA>   60
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "f") :
  invalid factor level, NA generated

We tried to add a row that had a character that was not a factor, so R complained. we should rather create the data frame so that R treats strings as strings.

> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L),stringsAsFactors=FALSE)
> d
  a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
> rbind(d,I(c(4,"f",60)))
  a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
6 4 f 60

This adds the row to the end of the data frame. To add the row at a particular index, two solutions have been suggested on StackOverflow

insertRow <- function(existingDF, newrow, r) {
  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
  existingDF[r,] <- newrow
  existingDF
}

insertRow2 <- function(existingDF, newrow, r) {
  existingDF <- rbind(existingDF,newrow)
  existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
  row.names(existingDF) <- 1:nrow(existingDF)
  return(existingDF)  
}

cbind() This function can be used to add a new variable or column

> cbind(d,k=c(TRUE,TRUE,FALSE,TRUE,FALSE))
  a b  c     k
1 1 a 10  TRUE
2 2 b 20  TRUE
3 3 c 30 FALSE
4 4 d 40  TRUE
5 5 e 50 FALSE

In the next tutorial we look at some more functions that can be applied to data frames.

Factors

Data Frames

Leave a Comment Cancel reply