Data Structures in R, Data Frame Operations

In the previous tutorial we saw an introduction to factors and data frames in R. In this tutorial we look at some more functions that can be used on data frames.

Multiple columns into single column – stack()

The first function that we look at today is the
stack
function. It can be used to combine multiple columns of a dataframe into a single column. The stack operation applies to only vector columns and columns containing factors will be ignored. For columns containing different types of vectors, the “unlist” function determines the type of the resultant column. (NULL < raw < logical < integer < double < complex < character < list < expression)

> d=data.frame(a=c(1,2,3,4),b=c(5,6,7,8))
> d
  a b
1 1 5
2 2 6
3 3 7
4 4 8
> stack(d)
  values ind
1      1   a
2      2   a
3      3   a
4      4   a
5      5   b
6      6   b
7      7   b
8      8   b

stack function produces two columns. The first column contains the values from all the vectors and the second column shows which vector the particular column is from. The unstack function reverses the operation

> k=stack(d)
> unstack(k)
  a b
1 1 5
2 2 6
3 3 7
4 4 8

access data frame variables directly – attach()

It seems to be a considerable effort accessing a variable by using the DataFrameName$VariableName. Would’nt it be really helpful if we could access the variable name directly? This can be done by using the
attach()
function. Here’s an example

# we first create a data frame
> d=data.frame(x1=c(1,2,3,4),x2=c(10,11,12,13))
> d
  x1 x2
1  1 10
2  2 11
3  3 12
4  4 13
# get the x1 column
> d$x1
[1] 1 2 3 4
#attach the x1 data frame
> attach(d)
# now we can access the elements directly
> x1
[1] 1 2 3 4

So how does it all work? When a data frame is attached to the R environment, R puts the variables in the environment search path. What that means is when we call a variable, R searches not only its environment but all all the columns in the attached data frame. If there is already a variable of that name in the workspace then R throws a warning.

> x1=c("a","b","c","d")
> attach(d)
The following object is masked _by_ .GlobalEnv:
    x1
>
#Calling x1 gives us the x1 variable in the user's environment and not the column from the data frame. 
> x1
[1] "a" "b" "c" "d"

If you attach multiple data frames then the data frame attached last is put on top of the search path.

> d=data.frame(x1=c(1,2,3,4),x2=c(101,111,121,131))
> d2=data.frame(x3=c(1,2,3,4),x2=c(10,11,12,13))
> attach(d)
> attach(d2)
The following object is masked from d:
    x2
> x2
[1] 10 11 12 13

In the example above we created two data frames with a common variable x2. When we attach the second dataframe, x2 from that data frame is set on top of the search path. However, you can change this default behaviour by explicity specifying the position of the data frame in the search path

> attach(d2,pos=3L)
The following object is masked _by_ d:

    x2

> x2
[1] 101 111 121 131

The function
detach()
can be used to detach a data frame from the environment.

R expression inside a data frame environment with() and within()

with(data,expr,…) function allows creating an environment for the data frame (data) and then evaluates a given R expression (expr) within that environment. The parent of the created environment is the environment from which the call is made. All assignments made in the expression are in the local environment and not in the user’s workspace; the assignments are therefore not available in the user’s workspace once call completes. The function returns the value from the expression

# we first create the data frame.
d=data.frame(x1=c(1,2,3,4),x2=c(101,111,121,131))
# Next we evaluate the expression x3=x1*x2. 
# Note that by default R returns the value of the last variable and does not need an explicit return. 
> with(d,{x3=x1*x2;x3})
[1] 101 222 363 524
# variable x3 is not available in user's environment.
> x3
Error: object 'x3' not found
>

within()
function also creates a local environment from the data frame and evaluates an expression, however the difference is that it returns a modified copy of the data frame.

> d2=within(d,{x3=x1*x2})
# variable x3 is now added to the data frame.
# a copy of the data frame is returned
> str(d2)
'data.frame':	4 obs. of  3 variables:
 $ x1: num  1 2 3 4
 $ x2: num  101 111 121 131
 $ x3: num  101 222 363 524
# The original data frame is still available
> str(d)
'data.frame':	4 obs. of  2 variables:
 $ x1: num  1 2 3 4
 $ x2: num  101 111 121 131
# adding x3 to data frame fails if it cannot be added
> d2=within(d,{x3=str(x2)})
 num [1:4] 101 111 121 131
> str(d2)
'data.frame':	4 obs. of  2 variables:
 $ x1: num  1 2 3 4
 $ x2: num  101 111 121 131

modify/add a variable to a data frame – transform()

The transform function can be used to change a variable or add a new variable to a data frame. Note that its generally recommended to use the subsetting operations for modification and this function should be used with caution.

> d=data.frame(x1=c(1,2,3,4),x2=c(101,111,121,131))
# variable x1 is changed. A copy of the original data frame is returned.
> transform(d,x1=c(11,12,13,14))
  x1  x2
1 11 101
2 12 111
3 13 121
4 14 131
#We can also add a new variable
> transform(d,x4=c(11,12,13,14))
  x1  x2 x4
1  1 101 11
2  2 111 12
3  3 121 13
4  4 131 14

Subset of a data frame that meet specific criteria – subset()

The subset function can be used to get a subset of the data frame such that only certain rows based on a given criteria. Here’s an example

> d=data.frame(x1=c(1,2,3,4),x2=c(101,111,121,131))
# we create a subset of d with rows that have x1 less than 3
> subset(d,x1<3)
  x1  x2
1  1 101
2  2 111
# if we want only the second column
> subset(d,x1<3,select=x2)
   x2
1 101
2 111

For Character/factors we can use grep to subset rows. Note that for factors unused levels are not automatically removed

> d1=data.frame(x1=c("a","b","c","d"),x2=c(101,102,103,104))
> d1subset=subset(d1,grepl("[a,b]",x1))
> str(d1subset)
'data.frame':	2 obs. of  2 variables:
 $ x1: Factor w/ 4 levels "a","b","c","d": 1 2
 $ x2: num  101 102

Convert observations in multiple columns to multiple rows – reshape()

We look at the reshape function using an example. Consider an experiment that measures the tide height every 6 hours in a day (at 00:00, 6:00, 12:00 and 18:00 ). This is how we represent the data

> tides=data.frame(dates=c("01-01-2016","01-02-2016","01-03-2016"),
obs1=c(1,2,1),obs2=c(2,2,3),obs3=c(1,1,2),obs4=c(2,2,3))
> tides
       dates obs1 obs2 obs3 obs4
1 01-01-2016    1    2    1    2
2 01-02-2016    2    2    1    2
3 01-03-2016    1    3    2    3

It would be good to convert the observations to separate rows. i.e. we want one row per observation per date, which means there will be four rows for a day. We can use the reshape function to accomplish this.

>reshape(tides,direction="long",varying=c("obs1","obs2","obs3","obs4"),v.names="obs")
         dates time obs id
1.1 01-01-2016    1   1  1
2.1 01-02-2016    1   2  2
3.1 01-03-2016    1   1  3
1.2 01-01-2016    2   2  1
2.2 01-02-2016    2   2  2
3.2 01-03-2016    2   3  3
1.3 01-01-2016    3   1  1
2.3 01-02-2016    3   1  2
3.3 01-03-2016    3   2  3
1.4 01-01-2016    4   2  1
2.4 01-02-2016    4   2  2
3.4 01-03-2016    4   3  3

Here’s an explanation of the arguments to the function

the first argument is the data frame that contains multiple observations.

The four variables obs1-obs4 are time varying variables. They get converted to a single variable in the long format.

v.names gives the name of the variable in the long format that correspond to the time varying variables in the wide format

The wide format contains rows that have multiple time values (0,6,12 and 18). The wide format uses the variable called “time” to denote which time the record comes from (so 00:00 is 1, 06:00 is 2, 12:00 is 3 and 18:00 is 4). The first three rows contain time=1 which means all the three rows come from the time at 00:00

The wide format also needs to identify the id of the row. In our case id is the date. (01-01-2016 is assigned id 1, 01-02-2106 is assigned id 2 and so on)

What if you have 2 sets of variables? i.e. you want to create two columns in the long format and each of the two variables get values from some columns of the wide format. Lets see how its done:

# tides data frame has four variables. We want to create two columns in the long format. 
 tides=data.frame(dates=c("01-01-2016","01-02-2016","01-03-2016"),
 obsa1=c(1,2,1),obsa2=c(2,2,3),obsb1=c(1,1,2),obsb2=c(2,2,3))
 > tides
       dates obsa1 obsa2 obsb1 obsb2
1 01-01-2016     1     2     1     2
2 01-02-2016     2     2     1     2
3 01-03-2016     1     3     2     3

 # obsa gets data from obsa1 and obsa2. obsb gets data from obsb1 and obsb2
 > reshape(tides,direction="long",varying=list(c("obs1","obs2"),
 c("obs3","obs4")),v.names=c("obsa","obsb"))
         dates time obsa obsb id
1.1 01-01-2016    1    1    1  1
2.1 01-02-2016    1    2    1  2
3.1 01-03-2016    1    1    2  3
1.2 01-01-2016    2    2    2  1
2.2 01-02-2016    2    2    2  2
3.2 01-03-2016    2    3    3  3
# reshape is quite smart though. If we had named the variables in a better way,
#  we can do this
> tides=data.frame(dates=c("01-01-2016","01-02-2016","01-03-2016"),
obsa_1=c(1,2,1),obsa_2=c(2,2,3),obsb_1=c(1,1,2),obsb_2=c(2,2,3))
> reshape(tides,dir="long",varying=2:5,sep='_')
         dates time obsa obsb id
1.1 01-01-2016    1    1    1  1
2.1 01-02-2016    1    2    1  2
3.1 01-03-2016    1    1    2  3
1.2 01-01-2016    2    2    2  1
2.2 01-02-2016    2    2    2  2
3.2 01-03-2016    2    3    3  3
 
 # if you call reshape again on the reshaped data frame
 # you get the original data frame
> a=reshape(tides,dir="long",varying=2:5,sep='_')
> reshape(a)
         dates id obsa_1 obsb_1 obsa_2 obsb_2
1.1 01-01-2016  1      1      1      2      2
2.1 01-02-2016  2      2      1      2      2
3.1 01-03-2016  3      1      2      3      3

Merge Data frames using names of columns or rows – merge()

The last function that we will look in this tutorial is the merge function. It does a join operation on the data frame. Here’s an example

> x=data.frame(col1=c(1,2,3,4,5,6,7,8),col2=c("a","b","c","d","e","f","g","h"))
> y=data.frame(col1=c(11,12,13,14,16,17,18),col2=c("a","b","c","d","f","g","h"))
# merge or join the data frame x and y by column col2
> merge(x,y,by="col2")
  col2 col1.x col1.y
1    a      1     11
2    b      2     12
3    c      3     13
4    d      4     14
5    f      6     16
6    g      7     17
7    h      8     18

Merge has a lot of other features which we will look at in a later tutorial or blog entry

This completes this article on the data frame. We will look at some of the functions from other packages later on. In the next tutorial lets look at how to write looping structures in R.