# Data Structures in R – factors and Data Frame

In the earlier tutorials we learnt vectors, list, arrays and matrix. In this tutorial we will look at factors and data frame

### Factors

Factors in R store categorical data. Categorical data has discreet values. For example, a vector of months of births of students in a class contains discreet values or categorical data.

• Their class is ‘factor’
• They have an attribute called levels of ‘character’ mode. The levels have to be unique
• If the factor is ordered then it has an additional class called ‘ordered’
• A factor cannot contain values that are not in its levels

Lets start looking at some examples

```
> x=c('jan','jan','march','april')
# the factors of x are the unique levels in x which is jan , march, april
> factor(x)
 jan   jan   march april
Levels: april jan march
# the class 'factor'
> f=factor(x)
> class(f)
 "factor"
#the levels are
> levels(f)
 "april" "jan"   "march"
#if you order the levels (this orders by name)
> ordered(f)
 jan   jan   march april
Levels: april < jan < march
> class(o)
 "ordered" "factor"
```

In the above example we let R determine the levels. We can also explicitly give it levels

```> factor(x,levels=c('jan','feb','march'))
 jan   jan   march <NA>
Levels: jan feb march
```

We gave it 3 levels, but since the data had four levels, R assigned an NA to the value that is not present in levels (‘april’)

We can tell R to calculate levels on its own, but exclude some values

```# we want to exclude "jan"
> factor(x,exclude=c("jan"))
 <NA>  <NA>  march april
Levels: april march
```

To make factor names more meaningful we can assign labels to each level

```> factor(x,labels=c("January","March","April"))
 March   March   April   January
Levels: January March April
```

If you want to allow NA as a level

```# NA is NOT a level here
> x=c('jan','jan','march','april',NA)
> factor(x)
 jan   jan   march april <NA>
Levels: april jan march
#NA is a level here
> x=c('jan','jan','march','april',NA)
> factor(x,exclude=NULL)
 jan   jan   march april <NA>
Levels: april jan march <NA>
```

If you want to drop the levels that do not occur then pass the factor again through the factor function

```# the level feb is not used
> factor(x,levels=c('jan','feb','march'))
 jan   jan   march <NA>   <NA>
Levels: jan feb march
# the level feb is dropped
> factor(factor(x,levels=c('jan','feb','march')))
 jan   jan   march <NA>   <NA>
Levels: jan march
```

Lets now look at ways to extract or replace parts of a factor

```> x=c('jan','jan','march','april')
> f=factor(x)
> f
 jan   jan   march april
Levels: april jan march
#lets extract the first two levels
> f[1:2]
 jan jan
Levels: april jan march
#also drop unused levels
> f[1:2,drop=TRUE]
 jan jan
Levels: jan
#replace a factor by one of the levels
> f <- 'march'
> f
 march jan   march april
Levels: april jan march
```

### Data Frames

Data frame is the most used data structure in R modeling packages. These are the characteristics of a data frame:

• A data frame is a matrix like data structure. i.e. it has rows and columns. However, unlike a matrix, a data frame can contain columns with different types of values (integer, character etc)
• A data frame has unique row names.
• data frame has the class “data.frame”
• A data frame can be thought of as a list (rows) of vectors(columns). Therefore all values in a column are of same type, however a row can have values of different types
• Data frame can have non unique column names
• Character vectors/variables passed to a data frame are converted to factors.

Lets look at some ways to create a data frame

• data frame from two numeric vectors

```> a=c(1,2,3,4)
> b=c(5,6,7,8)
> d=data.frame(a,b)
> d
a b
1 1 5
2 2 6
3 3 7
4 4 8
```

In the example above we have created a data frame using two numeric vectors. The name of the rows and columns have been automatically assigned. Lets specify Column names

```> d=data.frame(v1=a,v2=b)
> d
v1 v2
1  1  5
2  2  6
3  3  7
4  4  8
```

Lets also specify the row names

```> d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4"))
> d
v1 v2
r1  1  5
r2  2  6
r3  3  7
r4  4  8
```
• data frame from a numeric vector and a character vector

```> a=c(1,2,3,4)
> b=c("b1","b2","b3","b4")
> d=data.frame(v1=a,v2=b,row.names=c("r1","r2","r3","r4"))
> d
v1 v2
r1  1 b1
r2  2 b2
r3  3 b3
r4  4 b4
```

This looks obvious, however if you look at the structure of the data frame you will realize that v2 does not contain characters

```> str(d)
'data.frame':	4 obs. of  2 variables:
\$ v1: num  1 2 3 4
\$ v2: Factor w/ 4 levels "b1","b2","b3",..: 1 2 3 4
```

The second column contains factors and not characters

• Other ways to create data frames

Lets see how constructing a data frame from a list looks like

```> a=c(1,2,3,4)
> b=list("one","two")
> data.frame(a=a,b=b)
a b..one. b..two.
1 1     one     two
2 2     one     two
3 3     one     two
4 4     one     two
```

The list is added row wise and recycled to fill in all the rows. A data frame can also be created from another data frame and a vector

```> a=c(1,2,3,4)
> b=c("one","two","three","four")
> d=data.frame(a=a,b=b)
> d
a     b
1 1   one
2 2   two
3 3 three
4 4  four
#Now we create a dataframe from 'd' and another boolean vector
> m=c(TRUE,FALSE,FALSE,TRUE)
> e=data.frame(d,m)
> e
a     b     m
1 1   one  TRUE
2 2   two FALSE
3 3 three FALSE
4 4  four  TRUE
```

We now draw our attention to subsetting a data frame. Subsetting is the act of selecting specific values or range of values from the data frame. Since the data frame is a two dimensional structure, the easiest way to select a single value is to specify the row and column. Here’s an example.

```> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L))
> d
a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
# we first retrieve the element at first row,second column
> d[1,2]
 a
Levels: a b c d e
#recall that by default R stores characters as factors in a data frame
# Another way to select is to use the row and column name
> d["1","a"]
 1
#It will be good to give rows useful names.
> rownames(d)=c("r1","r2","r3","r4","r5")
> d
a b  c
r1 1 a 10
r2 2 b 20
r3 3 c 30
r4 4 d 40
r5 5 e 50
> d["r1","a"]
 1
#maybe assign names to columns as well
> colnames(d)=c("c1","c2","c3")
> d
c1 c2 c3
r1  1  a 10
r2  2  b 20
r3  3  c 30
r4  4  d 40
r5  5  e 50
# note that column names need not be unique. we could have done this
> colnames(d)=c("c1","c1","c1")
> d
c1 c1 c1
r1  1  a 10
r2  2  b 20
r3  3  c 30
r4  4  d 40
r5  5  e 50
#although there wouldnt be too many reasons to do this.
# We can't do this for rows though
> rownames(d)=c("r1","r2","r3","r4","r4")
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
non-unique value when setting 'row.names': ‘r4’
```

Its also possible to select multiple elements

```#To select a complete row
> d["r1",]
c1 c2 c3
r1  1  a 10
# To select a complete column
> d[,"c1"]
 1 2 3 4 5
#Note that the object returned is of class data.frame
#other way to select a row or a column is to specify a subscript
> d
c1
r1  1
r2  2
r3  3
r4  4
r5  5
> d[,1]
 1 2 3 4 5
# To select multiple rows/columns do this
> d[1:2]
c1 c2
r1  1  a
r2  2  b
r3  3  c
r4  4  d
r5  5  e
> d[1:2,]
c1 c2 c3
r1  1  a 10
r2  2  b 20
# another way to select multiple columns s
> d[,1:2]
c1 c2
r1  1  a
r2  2  b
r3  3  c
r4  4  d
r5  5  e
# to get a subset of the matrix do this
> d[1:2,1:2]
c1 c2
r1  1  a
r2  2  b
```

Another way to access a column in a data frame is to use the variable name with the symbol ‘\$’

```> d\$c1
 1 2 3 4 5
```

• `dim()` – This function gives the dimensions of the data frame
```> dim(d)
 5 3
```
• `names()` – This function can be used to get the names of the variables (columns) of a data frame. It can also be used to change their names
```> names(d) = c("col1","col2","col3")
> d
col1 col2 col3
r1    1    a   10
r2    2    b   20
r3    3    c   30
r4    4    d   40
r5    5    e   50
```
• `I()` – use the I() function to create a data frame with characters instead of factors <
```
> str(data.frame(x=c(1,2,3),y=I(c(“a”,”b”,”c”))))
‘data.frame’: 3 obs. of 2 variables:
\$ x: num 1 2 3
\$ y:Class ‘AsIs’ chr [1:3] “a” “b” “c”
```

In the example above we create a data frame with two variables, however we specify that the second variable should contain characters and should not be converted to vector which is the default behaviour for data frame.

• `head()` – Use the head function to get the first n rows of a data frame
```> head(d,n=2L)
col1 col2 col3
r1    1    a   10
r2    2    b   20
```
• `tail()` – Similar to the head function, the tail function can be used to get the last n rows of a data frame.
```> tail(d,n=2L)
col1 col2 col3
r4    4    d   40
r5    5    e   50
>
```
• `rbind()` – This function can be used to add a row to the data frame
```> rbind(d,I(c(4,"f",60)))
col1 col2 col3
r1    1    a   10
r2    2    b   20
r3    3    c   30
r4    4    d   40
r5    5    e   50
6     4 <NA>   60
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "f") :
invalid factor level, NA generated
```

We tried to add a row that had a character that was not a factor, so R complained. we should rather create the data frame so that R treats strings as strings.

```> d=data.frame(a=c(1,2,3,4,5),b=c('a','b','c','d','e'),c=c(10L,20L,30L,40L,50L),stringsAsFactors=FALSE)
> d
a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
> rbind(d,I(c(4,"f",60)))
a b  c
1 1 a 10
2 2 b 20
3 3 c 30
4 4 d 40
5 5 e 50
6 4 f 60
```

This adds the row to the end of the data frame. To add the row at a particular index, two solutions have been suggested on StackOverflow

```insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}

insertRow2 <- function(existingDF, newrow, r) {
existingDF <- rbind(existingDF,newrow)
existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
row.names(existingDF) <- 1:nrow(existingDF)
return(existingDF)
}
```
• `cbind()` This function can be used to add a new variable or column
```> cbind(d,k=c(TRUE,TRUE,FALSE,TRUE,FALSE))
a b  c     k
1 1 a 10  TRUE
2 2 b 20  TRUE
3 3 c 30 FALSE
4 4 d 40  TRUE
5 5 e 50 FALSE

```

In the next tutorial we look at some more functions that can be applied to data frames.