Some Useful Graphics

In this page we will have a look at some graphical displays that are useful for special types of data.

Trivariate Data

Generally we would need a three-dimensional graph to disply three variables. But there are some useful ways to display such data that only need two dimensions.
Let's return to the Ethanol dataset. Previously we have ignored the compression ratio and only looked at the equivalence ratio. This was justified by a scatterplot of NOx by C, which shows no relationship between them, as well as the correlation coefficient of -0.03 which is highly non-significant (p=0.77), see ethanol1.fun(1). Now let's try and include C in our analysis.
How can we visualize the three variables NOx, C and E together? One way is an extension of the scatterplot called a scatterplot matrix. Here we draw the 6 pairwise scatterplots and arrange them in a "matrix". This is done using the command pairs in ethanol1.fun(2).

When we draw a scatterplot such as NOx vs. E it is often useful to add a local regression curve to the plot, say a loess curve. We can do the same in the pairs command by adding panel=panel.smooth, see ethanol1.fun(3).

Notice that the compression ratio C only takes 5 distinct values (7.5, 9.0, 12.0, 15.0 and 18.0) corresponding to the test engines available for this experiment. Another way to look at this dataset is as follows:

• Split the dataset into 5 parts, according to the 5 values of C
• graph NOx vs E accordingly, and add a loess fit.

This is done in ethanol1.fun(4). If there is any interaction between C and E we should see different loess fits, and indeed the first and the last look somewhat different.

This type of graph is called a "coplot" for conditioning plot: each panel shows the scatterplot of y vs x depending on the values of z.

One problem here is that each "sub"graph is based on only a few observations (22, 17, 14, 19 and 16 to be precise). We can work on this by grouping together neighboring C values, see ethanol1.fun(5).

Actually, we don't need to do all this "by hand", R has the routine coplot to do the work for us. In ethanol1.fun(6) we draw the graph with the default choices.

A little bit of care is needed to properly read this graph. The panel that corresponds to the smallest range of C is the one on the lower left, then comes the one on its right etc.

Clearly by default coplot groups the data much more than we did above. This is controlled by the argument "overlap", with default 0.5 meaning each panle containes 50% of the data.

We can "recover" the first of our coplots using the given.values argument, see ethanol1.fun(7)

Of course we can switch the role of C and E, that is we can condition on E and use C in the panels, see ethanol1.fun(8). Here we use 9 panels (number=9), arrange them into two rows (row=2), and add the loess fit with span=2.
In this graph we find something interesting: while in all panels the relationship appears to be linear, for low values of E the slope of the line is positive whereas it becomes 0 as E gets larger.

We can also use the coplot to visualize the fit. In ethanol1.fun(9) we have the coplot of the loess fit g(C,E) to the ethanol data. In the (1,1) panel (lower left) E has been set to a specific value, 0.535. Then g(C,0.535) has been evaluated for equally spaced values of C, and g(C,0.535) is graphed in the panel.

Again we see that in each panel the fit appears quite linear, but with different slopes. So we should be able to use a fit that is linear in C for any given value of E but nonlinear in E. We can do this by using the loess command with the arguments parametric = "C", drop.square = "C", see ethanol1.fun(10).

Trellis Graphics

R also has a suit of advanced graphics called trellis graphics implemented, using the library lattice. The corresponding function to pairs is called splom and is used to draw the scatterplot matrix in trellis.fun(1). Trellis graphics use the formula notation we have seen many times now in R: y~x. In this case, though there is no response y because in the catterplot matrix all variables are treated equal. So we use this notation in a new way: ~ethanol
Notice also that trellis graphics are done a little differently (at least when used inside of a function): first we generate the graph and save the result in a trellis object (here trellispic), and then we actually draw the graph using print(trellispic).

The trellis function that does the scatterplot is called xyplot and is used in trellis.fun(2) to draw the scatterplot of NOx by E. xyplot also does the coplot, and we draw the coplot of NOx by E conditional on C in trellis.fun(3).
Say we want to turn this around and condition on the continuous variable E. In trellis.fun(4) this is done, and we see it does not work that easily. We need a new way to specify the intervals for the conditional variable E. The best way to do this is to use the equal.count function. It takes as its arguments the number of intervals we want (say 9) and the overlap (say 0.25). The resulting is a "shingle" object which when printed looks like a 9 by 2 matrix with the endpoints of the intervals. This is used in trellis.fun(5).
Notice the darker area in the bar above the panels. It identifies the interval of E values used in the panel.
Say we want to arrange the panels in the same way we did in ethanol1.fun(8) using two rows and 5 columns. In trellis.fun(6) this is done using the argument layout=c(5,2).
In ethanol1.fun(8) we added the loess curve to each panel. In trellis grphics this is achieve using the prepanel argument. Because it is much more powerful than the panel argument in coplot it (unfortunately) is also a lot more complicated. See trellis.fun(7) for details.
The last difference between this coplot and the one drawn by ethanol1.fun(8) is that there we had a separate panel on top identifying the shingles. If we wished we could do the same here, see trellis.fun(8), but in fact I like it better the way it was.

Bivariate Data

In all the datasets sofar it made sense to use the predictor-response paradigm, that is to think of one variable as the response which we wanto to model using the predictor variables. In other words we used the variables asymmetrically. In this section we will have a quick look at data where the two variables should be treated equally. As an example consider the Weather in New York dataset, specifically the data on wind speed and temperature. The goal here is not to determine how the variation in one explains the variation in the other, but to see how wind speed and temperature vary together.
As a first try we can certainly look at the scatterplot, where we arbitrarily put wind speed on the y axis, see env.fun(1).
In env.fun(2) we draw the boxplots, which give some indication that both variables have a normal distribution. This can be checked by a look at the normal quantile plots, see env.fun(3). The temperature appears to be slightly "platykurtic", that is the normal plot has an S-shape indicating a distribution with short tails such as a uniform. The wind speed appears to be skewed a little towards large values, see also the couple of dots higher up in the scatterplot.
Next we return to the scatterplot and add a local regression fit (loess). Now though it makes not any more sense to look at loess(wind~temperature) rather than loess(temperature~wind), and so we just do both! (env.fun(4)).

Bivariate Data with Equal Units

Consider the dataset Ozone levels. Here the two variables are measurements of the ozone levels at two different locations, so the variables have the same units.
As a first look we should of course draw the scatterplot, but because the two variables have the same units it makes sense to change two things from the default graph:
• make the range of the two axes the same
• add a line at 45°
This is done in ozone.fun(1). We see that the two levels are different, especially on days with higher ozone levels.
As another graph useful for this type of data we can draw Tukey's Mean-Difference plot. This graphs the means for each pair of observations vs. their difference, see ozone.fun(2). The graph is made a little more useful by adding a loess fit and a horizontal line at 0. In ozone.fun(3) we use the trellis function tmd to draw the graph with these lines added.
The usefulness of Tukey's m-d plot lies in the fact that it concentrates on the differences between the two variables. It is clear that concentrations at Stamford are higher than at Yonkers, in all but 10 days the concentration at Stamford exceeds that at Yonkers. The pollution is of course caused by New York City, and so it is at first surprising that it should be higher at Stamford, which is 20 miles farther from New York City than Yonkers. The reason is that Stamford lies downwind from New York City on most days.
A further analysis of this dataset would be easy if the relationship between the ozone levels were additive, that is if Stamford had a higher concentration across the board. ozone.fun(3) shows an artifical dataset where this is the case. In this situation a model such as mStamford=mYonker+a would be appropriate.
Here so it is a little more complicated. Instead of additive the relationship appears to be multiplicative, that is we have mStamford=r×mYonker. But by taking logs we can turn this into an additive model: log(mStamford)=log(mYonker)+log(r). In ozone.fun(6) we plot the m-d graph based on logs, and we see that things are much better, although these is still some upward movement in the graph. This effect is very small: at low concentrations, Stamford exceeds Yonkers by 0.15log10 ppb, which is a factor of 100.15=1.4. At high concentrations the difference is 0.25log10 ppb, which is a factor of 100.25=1.8.

Notice we are using log10, which is more common than natural log.

Now that we know the relationship is (almost) multiplicative we can also fit this model parametrically: Of course we want the slope to be 1 so we need to fit the model log(Stamford)=b+log(Yonkers), or log(Stamford)-log(Yonkers)=b, which is done in ozone.fun(7).

Using lm to do the fit is of course a little bit of overkill, we could have just found the mean of the differences, see ozone.fun(7) again

We find b=0.2, or r=100.2=1.6. In ozone.fun(8) we draw the fitted line graph, which seems quite good.