Data Structures in R, Data Frame Operations

In the previous tutorial we saw an introduction to factors and data frames in R. In this tutorial we look at some more functions that can be used on data frames.

Multiple columns into single column – stack()

The first function that we look at today is the
function. It can be used to combine multiple columns of a dataframe into a single column. The stack operation applies to only vector columns and columns containing factors will be ignored. For columns containing different types of vectors, the “unlist” function determines the type of the resultant column. (NULL < raw < logical < integer < double < complex < character < list < expression)

stack function produces two columns. The first column contains the values from all the vectors and the second column shows which vector the particular column is from. The unstack function reverses the operation

access data frame variables directly – attach()

It seems to be a considerable effort accessing a variable by using the DataFrameName$VariableName. Would’nt it be really helpful if we could access the variable name directly? This can be done by using the
function. Here’s an example

So how does it all work? When a data frame is attached to the R environment, R puts the variables in the environment search path. What that means is when we call a variable, R searches not only its environment but all all the columns in the attached data frame. If there is already a variable of that name in the workspace then R throws a warning.

If you attach multiple data frames then the data frame attached last is put on top of the search path.

In the example above we created two data frames with a common variable x2. When we attach the second dataframe, x2 from that data frame is set on top of the search path. However, you can change this default behaviour by explicity specifying the position of the data frame in the search path

The function
can be used to detach a data frame from the environment.

R expression inside a data frame environment with() and within()

with(data,expr,…) function allows creating an environment for the data frame (data) and then evaluates a given R expression (expr) within that environment. The parent of the created environment is the environment from which the call is made. All assignments made in the expression are in the local environment and not in the user’s workspace; the assignments are therefore not available in the user’s workspace once call completes. The function returns the value from the expression

function also creates a local environment from the data frame and evaluates an expression, however the difference is that it returns a modified copy of the data frame.

modify/add a variable to a data frame – transform()

The transform function can be used to change a variable or add a new variable to a data frame. Note that its generally recommended to use the subsetting operations for modification and this function should be used with caution.

Subset of a data frame that meet specific criteria – subset()

The subset function can be used to get a subset of the data frame such that only certain rows based on a given criteria. Here’s an example

For Character/factors we can use grep to subset rows. Note that for factors unused levels are not automatically removed

Convert observations in multiple columns to multiple rows – reshape()

We look at the reshape function using an example. Consider an experiment that measures the tide height every 6 hours in a day (at 00:00, 6:00, 12:00 and 18:00 ). This is how we represent the data

It would be good to convert the observations to separate rows. i.e. we want one row per observation per date, which means there will be four rows for a day. We can use the reshape function to accomplish this.

Here’s an explanation of the arguments to the function

the first argument is the data frame that contains multiple observations.

The four variables obs1-obs4 are time varying variables. They get converted to a single variable in the long format.

v.names gives the name of the variable in the long format that correspond to the time varying variables in the wide format

The wide format contains rows that have multiple time values (0,6,12 and 18). The wide format uses the variable called “time” to denote which time the record comes from (so 00:00 is 1, 06:00 is 2, 12:00 is 3 and 18:00 is 4). The first three rows contain time=1 which means all the three rows come from the time at 00:00

The wide format also needs to identify the id of the row. In our case id is the date. (01-01-2016 is assigned id 1, 01-02-2106 is assigned id 2 and so on)

What if you have 2 sets of variables? i.e. you want to create two columns in the long format and each of the two variables get values from some columns of the wide format. Lets see how its done:

Merge Data frames using names of columns or rows – merge()

The last function that we will look in this tutorial is the merge function. It does a join operation on the data frame. Here’s an example

Merge has a lot of other features which we will look at in a later tutorial or blog entry

This completes this article on the data frame. We will look at some of the functions from other packages later on. In the next tutorial lets look at how to write looping structures in R.

Leave a Reply

Your email address will not be published. Required fields are marked *