Reshape R package, reshape2, melt and cast

In this tutorial we will look at Reshape R package, reshape2, melt and cast functions.

Reshape R package – Introduction and concepts

In this tutorial on data frame we saw the reshape function that can be used to convert multiple columns of a dataframe to multiple rows. In the current tutorial we look at the reshape package that was designed and developed by Hadley Wickham to overcome the shortcomings of the reshape and other related functions in base R. Before we look at the methods in the package, lets look at some concepts.

Why Reshape R Package?

The data obtained as a result of an experiment or study is generally different from data required as an input to the analytic functions. Generally the data from a study has one or more columns that uniquely identify a row followed by a number of columns that represent the values measured. The columns that uniquely identify the row can be thought of as composite key of a database column. For example, consider a data frame representing climate statistics. The unique columns identifying a row can be city name and month. (Each row gives monthly statistics for a particular city). The table can report multiple statistics such as min and max temperatures, max wind speed, max humidity and so on. Each statistic is a column in this table. In most cases we will need to convert this table so that it contains the id columns and ONE column representing the statistic. Here’s a diagrammatic representation of what we want.

reshape R package

Definitions of terms in reshape R package

Lets define some terms :

  • Identifier (id)

    – These variables can uniquely identify a row. In the example above city name and month are the identifiers for the first table and city name, month and Variable are the identifiers for the second table.

  • Measured Variables

    These are the variables or columns that contain the values. In the first table Min Temp, Max Temp and Max Wind are the measured variables. In the second table the column called “value” is the measured variable.

  • Melting

    – This is the process that produces table 2 from table 1.

  • Molten Table

    – A molten table has id columns (one of which is a column called ‘variable’) and one column called ‘value’.

Note that all the measured variables have to be of the same type (numeric, factor or date) since they will be saved in a single column called ‘value’ after melting.

melt() example

We haven’t yet looked at casting, but lets see an example of melting before looking at casting. We will use the airquality data frame in the ‘datasets’ package.

The data frame contains 6 variables and 153 observations. The variables Day and Month are the “id” variables and the others are the measured variables. We would like to melt this data frame. Here’s how it is done :

The function melt is use to melt the airquality data frame. The id variables are Month and day. The other variables are specified as measured variables. The resulting object contains 4 columns as expected. Lets look at 10 random rows from this object.

We notice that there are rows where the value is NA and we dont really need these rows. We can pass in na.rm=TRUE to remove NA values

The molten object now does not contain rows with NA values. There are only 568 rows now instead of 612 that we saw earlier. You can also omit measured.vars if you specify id.vars. You can also specify another name for the “variable” column

cast()

Once a data frame is molten you can then modify the molten data frame to whatever form that is required for analysis. This is known as casting. In its simplest form it gets back the original dataframe that was used in the melt function, however, that’s not much fun. Lets see why cast is so good. You can also pass in another argument to cast. This argument is a formula of type col_1+col_2+col_3~row_1+row_2+row_3. On the left side of the ~ operator you specify all variables that you want as columns and on the right side of ~ you specify all variables that you want as rows. If there are multiple variables on either side then separate them by +. Lets see how to get the original dataframe using a formula

We can also use the … operator. This operator instructs R to use all variables that have not been used in the formula.

Lets see what happens if we use only one of Month or Day

Here We specified only Month and not Day, The values that we get are for a particular month. Its quite obvious that we need to use a function that tell how to combine values for a month from multiple days. In our example we tell it to get a mean.

We can get a transposed version too.

Another useful feature is to use . (dot) which signifies no variables.

In the above example we get a mean for all months and Days. We can also do this

Of course, in this case it does not make sense to average out Ozone, Wind etc, but this just demonstrates what’s possible.

We can also create multi dimensional arrays. If you have two sets of hierarchies then this can help in create aggregates in multiple dimensions. For example if you had a data set that has sales with location hierarchy (state, cities etc) and time hierarchy (Months, days etc) then you can create arrays that give sales aggregated by state for a month and so on. Here an example

You can use the | operator to create lists (not all output is shown below)

You can create margins by setting margins=TRUE. Margins can be thought of as “Column totals” or “Row Totals” or “Grand Total”

Its also possible to pass in multiple functions (so far we have been using only mean)

I hope that at the end of this tutorial you are as amazed by this package as I am. Its truly quite powerful, but needless to say, be cautious while using it for huge amounts of data, for everything happens in memory.

Leave a Reply

Your email address will not be published. Required fields are marked *