Reshape R package, reshape2, melt and cast

In this tutorial we will look at Reshape R package, reshape2, melt and cast functions.

Reshape R package – Introduction and concepts

In this tutorial on data frame we saw the reshape function that can be used to convert multiple columns of a dataframe to multiple rows. In the current tutorial we look at the reshape package that was designed and developed by Hadley Wickham to overcome the shortcomings of the reshape and other related functions in base R. Before we look at the methods in the package, lets look at some concepts.

Why Reshape R Package?

The data obtained as a result of an experiment or study is generally different from data required as an input to the analytic functions. Generally the data from a study has one or more columns that uniquely identify a row followed by a number of columns that represent the values measured. The columns that uniquely identify the row can be thought of as composite key of a database column. For example, consider a data frame representing climate statistics. The unique columns identifying a row can be city name and month. (Each row gives monthly statistics for a particular city). The table can report multiple statistics such as min and max temperatures, max wind speed, max humidity and so on. Each statistic is a column in this table. In most cases we will need to convert this table so that it contains the id columns and ONE column representing the statistic. Here’s a diagrammatic representation of what we want.

reshape R package

Definitions of terms in reshape R package

Lets define some terms :

  • Identifier (id)

    – These variables can uniquely identify a row. In the example above city name and month are the identifiers for the first table and city name, month and Variable are the identifiers for the second table.

  • Measured Variables

    These are the variables or columns that contain the values. In the first table Min Temp, Max Temp and Max Wind are the measured variables. In the second table the column called “value” is the measured variable.

  • Melting

    – This is the process that produces table 2 from table 1.

  • Molten Table

    – A molten table has id columns (one of which is a column called ‘variable’) and one column called ‘value’.

Note that all the measured variables have to be of the same type (numeric, factor or date) since they will be saved in a single column called ‘value’ after melting.

melt() example

We haven’t yet looked at casting, but lets see an example of melting before looking at casting. We will use the airquality data frame in the ‘datasets’ package.

> library(datasets)
> str(airquality)
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

The data frame contains 6 variables and 153 observations. The variables Day and Month are the “id” variables and the others are the measured variables. We would like to melt this data frame. Here’s how it is done :

> molted = melt(airquality,id.vars=c("Month","Day"),measured.vars=c("Ozone","Solar.R","Wind","Temp"))
> str(molted)
'data.frame':	612 obs. of  4 variables:
 $ Month   : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ variable: Factor w/ 4 levels "Ozone","Solar.R",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  41 36 12 18 NA 28 23 19 8 NA ...

The function melt is use to melt the airquality data frame. The id variables are Month and day. The other variables are specified as measured variables. The resulting object contains 4 columns as expected. Lets look at 10 random rows from this object.

> molted[sample(nrow(molted),10),]
    Month Day variable value
32      6   1    Ozone    NA
353     6  16     Wind  14.9
548     7  28     Temp  88.0
85      7  24    Ozone  80.0
211     6  27  Solar.R  47.0
294     9  18  Solar.R  27.0
394     7  27     Wind  12.0
554     8   3     Temp  82.0
164     5  11  Solar.R    NA
3       5   3    Ozone  12.0

We notice that there are rows where the value is NA and we dont really need these rows. We can pass in na.rm=TRUE to remove NA values

> molted = melt(airquality,id.vars=c("Month","Day"),measured.vars=c("Ozone","Solar.R","Wind","Temp"),na.rm=TRUE)
> molted[sample(nrow(molted),10),]
    Month Day variable value
355     8   1     Wind   6.9
565     9  27     Temp  77.0
523     8  16     Temp  77.0
421     5   6     Temp  66.0
476     6  30     Temp  83.0
307     6  14     Wind  13.8
301     6   8     Wind   6.9
346     7  23     Wind  11.5
290     5  28     Wind  12.0
535     8  28     Temp  97.0
> str(molted)
'data.frame':	568 obs. of  4 variables:
 $ Month   : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day     : int  1 2 3 4 6 7 8 9 11 12 ...
 $ variable: Factor w/ 4 levels "Ozone","Solar.R",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  41 36 12 18 28 23 19 8 7 16 ...

The molten object now does not contain rows with NA values. There are only 568 rows now instead of 612 that we saw earlier. You can also omit measured.vars if you specify id.vars. You can also specify another name for the “variable” column

> molted = melt(airquality,id.vars=c("Month","Day"),measured.vars=c("Ozone","Solar.R","Wind"),variable_name="measured")
> str(molted)
'data.frame':	612 obs. of  4 variables:
 $ Month   : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ measured: Factor w/ 4 levels "Ozone","Solar.R",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  41 36 12 18 NA 28 23 19 8 NA ...

cast()

Once a data frame is molten you can then modify the molten data frame to whatever form that is required for analysis. This is known as casting. In its simplest form it gets back the original dataframe that was used in the melt function, however, that’s not much fun. Lets see why cast is so good. You can also pass in another argument to cast. This argument is a formula of type col_1+col_2+col_3~row_1+row_2+row_3. On the left side of the ~ operator you specify all variables that you want as columns and on the right side of ~ you specify all variables that you want as rows. If there are multiple variables on either side then separate them by +. Lets see how to get the original dataframe using a formula

> head(cast(molted,Month+Day~variable))
  Month Day Ozone Solar.R Wind Temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74
4     5   4    18     313 11.5   62
5     5   5    NA      NA 14.3   56
6     5   6    28      NA 14.9   66

We can also use the … operator. This operator instructs R to use all variables that have not been used in the formula.

> head(cast(molted,...~variable))
  Month Day Ozone Solar.R Wind Temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74

Lets see what happens if we use only one of Month or Day

> cast(molted,Month~variable,fun.aggregate=mean)
  Month    Ozone  Solar.R      Wind     Temp
1     5 23.61538 181.2963 11.622581 65.54839
2     6 29.44444 190.1667 10.266667 79.10000
3     7 59.11538 216.4839  8.941935 83.90323
4     8 59.96154 171.8571  8.793548 83.96774
5     9 31.44828 167.4333 10.180000 76.90000

Here We specified only Month and not Day, The values that we get are for a particular month. Its quite obvious that we need to use a function that tell how to combine values for a month from multiple days. In our example we tell it to get a mean.

We can get a transposed version too.

> cast(molted,variable~Month,fun.aggregate=mean)
  variable         5         6          7          8         9
1    Ozone  23.61538  29.44444  59.115385  59.961538  31.44828
2  Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
3     Wind  11.62258  10.26667   8.941935   8.793548  10.18000
4     Temp  65.54839  79.10000  83.903226  83.967742  76.90000

Another useful feature is to use . (dot) which signifies no variables.

> cast(molted,variable~.,fun.aggregate=mean)
  variable      (all)
1    Ozone  42.129310
2  Solar.R 185.931507
3     Wind   9.957516
4     Temp  77.882353

In the above example we get a mean for all months and Days. We can also do this

> cast(molted,Month~.,fun.aggregate=mean)
  Month    (all)
1     5 68.70696
2     6 87.38384
3     7 93.49748
4     8 79.71207
5     9 71.82689

Of course, in this case it does not make sense to average out Ozone, Wind etc, but this just demonstrates what’s possible.

We can also create multi dimensional arrays. If you have two sets of hierarchies then this can help in create aggregates in multiple dimensions. For example if you had a data set that has sales with location hierarchy (state, cities etc) and time hierarchy (Months, days etc) then you can create arrays that give sales aggregated by state for a month and so on. Here an example

> dim(cast(molted,Month~Day~variable,fun.aggregate=mean))
[1]  5 31  4

You can use the | operator to create lists (not all output is shown below)

> print(cast(molted,variable~Day|Month,fun.aggregate=mean))
$`5`
  variable     1   2     3     4    5    6     7    8    9    10   11    12
1    Ozone  41.0  36  12.0  18.0  NaN 28.0  23.0 19.0  8.0   NaN  7.0  16.0
2  Solar.R 190.0 118 149.0 313.0  NaN  NaN 299.0 99.0 19.0 194.0  NaN 256.0
3     Wind   7.4   8  12.6  11.5 14.3 14.9   8.6 13.8 20.1   8.6  6.9   9.7
4     Temp  67.0  72  74.0  62.0 56.0 66.0  65.0 59.0 61.0  69.0 74.0  69.0
....

$`6`
  variable     1     2     3     4     5     6     7     8     9    10    11
1    Ozone   NaN   NaN   NaN   NaN   NaN   NaN  29.0   NaN  71.0  39.0   NaN
2  Solar.R 286.0 287.0 242.0 186.0 220.0 264.0 127.0 273.0 291.0 323.0 259.0
3     Wind   8.6   9.7  16.1   9.2   8.6  14.3   9.7   6.9  13.8  11.5  10.9
4     Temp  78.0  74.0  67.0  84.0  85.0  79.0  82.0  87.0  90.0  87.0  93.0
..

You can create margins by setting margins=TRUE. Margins can be thought of as “Column totals” or “Row Totals” or “Grand Total”

> cast(molted,variable~Month,fun.aggregate=mean,margins=TRUE)
  variable         5         6          7          8         9      (all)
1    Ozone  23.61538  29.44444  59.115385  59.961538  31.44828  42.129310
2  Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 185.931507
3     Wind  11.62258  10.26667   8.941935   8.793548  10.18000   9.957516
4     Temp  65.54839  79.10000  83.903226  83.967742  76.90000  77.882353
5    (all)  68.70696  87.38384  93.497479  79.712069  71.82689  80.057218

Its also possible to pass in multiple functions (so far we have been using only mean)

> cast(molted,variable~Month,fun.aggregate=c(mean,sum))
  variable    5_mean  5_sum    6_mean 6_sum     7_mean  7_sum     8_mean  8_sum
1    Ozone  23.61538  614.0  29.44444   265  59.115385 1537.0  59.961538 1559.0
2  Solar.R 181.29630 4895.0 190.16667  5705 216.483871 6711.0 171.857143 4812.0
3     Wind  11.62258  360.3  10.26667   308   8.941935  277.2   8.793548  272.6
4     Temp  65.54839 2032.0  79.10000  2373  83.903226 2601.0  83.967742 2603.0
     9_mean  9_sum
1  31.44828  912.0
2 167.43333 5023.0
3  10.18000  305.4
4  76.90000 2307.0

I hope that at the end of this tutorial you are as amazed by this package as I am. Its truly quite powerful, but needless to say, be cautious while using it for huge amounts of data, for everything happens in memory.

Leave a Reply

Your email address will not be published. Required fields are marked *