--- title: 'High Dimensional Data Visualization' author: "Wayne Oldford and Zehao Xu" date: "`r Sys.Date()`" bibliography: references.bib fontsize: 12pt link-citations: yes linkcolor: blue output: rmarkdown::html_vignette: toc: true geometry: margin=.75in urlcolor: blue graphics: yes vignette: > %\VignetteIndexEntry{High Dimensional Data Visualization} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} header-includes: - \usepackage{graphicx} - \usepackage{epic} - \usepackage{color} - \usepackage{hyperref} - \usepackage{multimedia} - \PassOptionsToPackage{pdfmark}{hyperref}\RequirePackage{hyperref} - \newcommand{\code}[1]{\texttt{#1}} - \newcommand{\ve}[1]{\mathbf{#1}} - \newcommand{\pop}[1]{\mathcal{#1}} - \newcommand{\samp}[1]{\mathcal{#1}} - \newcommand{\subspace}[1]{\mathcal{#1}} - \newcommand{\sv}[1]{\boldsymbol{#1}} - \newcommand{\sm}[1]{\boldsymbol{#1}} - \newcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}} - \newcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert} - \newcommand{\size}[1]{\left\lvert {#1} \right\rvert} - \newcommand{\norm}[1]{\left|\left|{#1}\right|\right|} - \newcommand{\field}[1]{\mathbb{#1}} - \newcommand{\Reals}{\field{R}} - \newcommand{\Integers}{\field{Z}} - \newcommand{\Naturals}{\field{N}} - \newcommand{\Complex}{\field{C}} - \newcommand{\Rationals}{\field{Q}} - \newcommand{\widebar}[1]{\overline{#1}} - \newcommand{\wig}[1]{\tilde{#1}} - \newcommand{\bigwig}[1]{\widetilde{#1}} - \newcommand{\leftgiven}{~\left\lvert~} - \newcommand{\given}{~\vert~} - \newcommand{\indep}{\bot\hspace{-.6em}\bot} - \newcommand{\notindep}{\bot\hspace{-.6em}\bot\hspace{-0.75em}/\hspace{.4em}} - \newcommand{\depend}{\Join} - \newcommand{\notdepend}{\Join\hspace{-0.9 em}/\hspace{.4em}} - \newcommand{\imply}{\Longrightarrow} - \newcommand{\notimply}{\Longrightarrow \hspace{-1.5em}/ \hspace{0.8em}} - \newcommand*{\intersect}{\cap} - \newcommand*{\union}{\cup} - \DeclareMathOperator*{\argmin}{arg\,min} - \DeclareMathOperator*{\argmax}{arg\,max} - \DeclareMathOperator*{\Ave}{Ave\,} - \newcommand{\permpause}{\pause} - \newcommand{\suchthat}{~:~} - \newcommand{\st}{~:~} --- ```{r setup, include=FALSE, warning=FALSE, message=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = "center", fig.width = 7, fig.height = 5, out.width = "60%", collapse = TRUE, comment = "#>", tidy.opts = list(width.cutoff = 65), tidy = FALSE) library(knitr) set.seed(12314159) imageDirectory <- "./images/highDim" dataDirectory <- "./data/highDim" path_concat <- function(path1, ..., sep="/") { # The "/" is standard unix directory separator and so will # work on Macs and Linux. # In windows the separator might have to be sep = "\" or # even sep = "\\" or possibly something else. paste(path1, ..., sep = sep) } library(ggplot2, quietly = TRUE) library(dplyr, quietly = TRUE) ``` ## Serialaxes coordinate Serial axes coordinate is a methodology for visualizing the $p$-dimensional geometry and multivariate data. As the name suggested, all axes are shown in serial. The axes can be a finite $p$ space or transformed to an infinite space (e.g. Fourier transformation). In the finite $p$ space, all axes can be displayed in parallel which is known as the parallel coordinate; also, all axes can be displayed under a polar coordinate that is often known as the radial coordinate or radar plot. In the infinite space, a mathematical transformation is often applied. More details will be explained in the sub-section `Infinite axes` A point in Euclidean $p$-space $R^p$ is represented as a polyline in serial axes coordinate, it is found that a point <--> line duality is induced in the Euclidean plane $R^2$ [@146402]. Before we start, a couple of things should be noticed: - In the serial axes coordinate system, no `x` or `y` (even `group`) are required; but other aesthetics, such as `colour`, `fill`, `size`, etc, are accommodated. - Layer `geom_path` is used to draw the serial lines; layer `geom_histogram`, `geom_quantiles`, and `geom_density` are used to draw the histograms, quantiles (*not `quantile` regression*) and densities. Users can also customize their own layer (i.e. `geom_boxplot`, `geom_violin`, etc) by editing function `add_serialaxes_layers`. ### Finite axes Suppose we are interested in the data set `iris`. A parallel coordinate chart can be created as followings: ```{r serialaxes} library(ggmulti) # parallel axes plot ggplot(iris, mapping = aes( Sepal.Length = Sepal.Length, Sepal.Width = Sepal.Width, Petal.Length = Petal.Length, Petal.Width = Petal.Width, colour = factor(Species))) + geom_path(alpha = 0.2) + coord_serialaxes() -> p p ``` A histogram layer can be displayed by adding layer `geom_histogram` ```{r serialaxes histogram,} p + geom_histogram(alpha = 0.3, mapping = aes(fill = factor(Species))) + theme(axis.text.x = element_text(angle = 30, hjust = 0.7)) ``` A density layer can be drawn by adding layer `geom_density` ```{r serialaxes density} p + geom_density(alpha = 0.3, mapping = aes(fill = factor(Species))) ``` A parallel coordinate can be converted to radial coordinate by setting `axes.layout = "radial"` in function `coord_serialaxes`. ```{r radial, fig.width = 5} p$coordinates$axes.layout <- "radial" p ``` Note that: layers, such as `geom_histogram`, `geom_density`, etc, are not implemented in the radial coordinate yet. ### Infinite axes @andrews1972plots plot is a way to project multi-response observations into a function $f(t)$, by defining $f(t)$ as an inner product of the observed values of responses and orthonormal functions in $t$ \[f_{y_i}(t) = <\ve{y}_i, \ve{a}_t>\] where $\ve{y}_i$ is the $i$th responses and $\ve{a}_t$ is the orthonormal functions under certain interval. Andrew suggests to use the Fourier transformation \[\ve{a}_t = \{\frac{1}{\sqrt{2}}, \sin(t), \cos(t), \sin(2t), \cos(2t), ...\}\] which are orthonormal on interval $(-\pi, \pi)$. In other word, we can project a $p$ dimensional space to an infinite $(-\pi, \pi)$ space. The following figure illustrates how to construct an "Andrew's plot". ```{r andrews} p <- ggplot(iris, mapping = aes(Sepal.Length = Sepal.Length, Sepal.Width = Sepal.Width, Petal.Length = Petal.Length, Petal.Width = Petal.Width, colour = Species)) + geom_path(alpha = 0.2, stat = "dotProduct") + coord_serialaxes() p ``` A quantile layer can be displayed on top ```{r andrews with quantile} p + geom_quantiles(stat = "dotProduct", quantiles = c(0.25, 0.5, 0.75), linewidth = 2, linetype = 2) ``` A couple of things should be noticed: - mapping aesthetics is used to define the $p$ dimensional space, if not provided, all columns in the dataset 'iris' will be transformed. An alternative way to determine the $p$ dimensional space to set parameter `axes.sequence` in each layer or in `coord_serialaxes`. - To construct a dot product serial axes plot, say Fourier transformation, "Andrew's plot", we need to set the parameter `stat` in `geom_path` to "dotProduct". The default transformation function is the Andrew's (function `andrews`). Users can customize their own, for example, Tukey suggests the following projected space \[\ve{a}_t = \{\cos(t), \cos(\sqrt{2}t), \cos(\sqrt{3}t), \cos(\sqrt{5}t), ...\}\] where $t \in [0, k\pi]$ [@gnanadesikan2011methods]. ```{r tukey} tukey <- function(p = 4, k = 50 * (p - 1), ...) { t <- seq(0, p* base::pi, length.out = k) seq_k <- seq(p) values <- sapply(seq_k, function(i) { if(i == 1) return(cos(t)) if(i == 2) return(cos(sqrt(2) * t)) Fibonacci <- seq_k[i - 1] + seq_k[i - 2] cos(sqrt(Fibonacci) * t) }) list( vector = t, matrix = matrix(values, nrow = p, byrow = TRUE) ) } ggplot(iris, mapping = aes(Sepal.Length = Sepal.Length, Sepal.Width = Sepal.Width, Petal.Length = Petal.Length, Petal.Width = Petal.Width, colour = Species)) + geom_path(alpha = 0.2, stat = "dotProduct", transform = tukey) + coord_serialaxes() ``` Note that: Tukey's suggestion, element $\ve{a}_t$ can "cover" more spheres in $p$ dimensional space, but it is not orthonormal. ### An alternative way to create a serial axes plot Rather than calling function `coord_serialaxes`, an alternative way to create a serial axes object is to add a `geom_serialaxes_...` object in our model. For example, Figure 1 to 4 can be created by calling ```{r geom_serialaxes_ objects, eval = FALSE} g <- ggplot(iris, mapping = aes(Sepal.Length = Sepal.Length, Sepal.Width = Sepal.Width, Petal.Length = Petal.Length, Petal.Width = Petal.Width, colour = Species)) g + geom_serialaxes(alpha = 0.2) g + geom_serialaxes(alpha = 0.2) + geom_serialaxes_hist(mapping = aes(fill = Species), alpha = 0.2) g + geom_serialaxes(alpha = 0.2) + geom_serialaxes_density(mapping = aes(fill = Species), alpha = 0.2) # radial axes can be created by # calling `coord_radial()` # this is slightly different, check it out! g + geom_serialaxes(alpha = 0.2) + geom_serialaxes(alpha = 0.2) + coord_radial() ``` Figure 5 and 7 can be created by setting "stat" and "transform" in `geom_serialaxes`; to Figure 6, `geom_serialaxes_quantile` can be added to create a serial axes quantile layer. Some slight difference should be noticed here: * One benefit of calling `coord_serialaxes` rather than `geom_serialaxes_...` is that `coord_serialaxes` can accommodate duplicated axes in mapping aesthetics (e.g. *Eulerian path*, *Hamiltonian path*, etc). However, in `geom_serialaxes_...`, duplicated axes will be omitted. * Meaningful axes labels in `coord_serialaxes` can be created automatically, while in `geom_serialaxes_...`, users have to set axes labels by `ggplot2::scale_x_continuous` or `ggplot2::scale_y_continuous` manually. * As we turn the serial axes into interactive graphics (via package [loon.ggplot](https://great-northern-diver.github.io/loon.ggplot/)), serial axes lines in `coord_serialaxes()` could be turned as interactive but in `geom_serialaxes_...` all objects are static. ```{r benefits of coord_serialaxes, eval=FALSE} # The serial axes is `Sepal.Length`, `Sepal.Width`, `Sepal.Length` # With meaningful labels ggplot(iris, mapping = aes(Sepal.Length = Sepal.Length, Sepal.Width = Sepal.Width, Sepal.Length = Sepal.Length)) + geom_path() + coord_serialaxes() # The serial axes is `Sepal.Length`, `Sepal.Length` # No meaningful labels ggplot(iris, mapping = aes(Sepal.Length = Sepal.Length, Sepal.Width = Sepal.Width, Sepal.Length = Sepal.Length)) + geom_serialaxes() ``` Also, if the dimension of data is large, typing each variate in mapping aesthetics is such a headache. Parameter `axes.sequence` is provided to determine the axes. For example, a `serialaxes` object can be created as ```{r axes.sequence, eval=FALSE} ggplot(iris) + geom_path() + coord_serialaxes(axes.sequence = colnames(iris)[-5]) ``` At very end, please report bugs [here](https://github.com/z267xu/ggmulti/issues). Enjoy the high dimensional visualization! "Don't panic... Just do it in 'serial'" [@inselberg1999don]. ## Reference